genomic locus

2017-12-15 Thread Gene Selkov
Greetings everyone,

I need a data type to represent genomic positions, which will consist of a
string and a pair of integers with interval logic and access methods. Sort
of like my seg type, but more straightforward.

I noticed somebody took a good care of seg while I was away for the last 20
years, and I am extremely grateful for that. I have been using it. In the
meantime, things have changed and now I am almost clueless about how you
deal with contributed modules and what steps I should take once I get it to
work. Also, is it a good idea to clone and fix seg for this purpose, or is
there a more appropriate template? Or maybe just building it from scratch
will be a better idea?

I have seen a lot of bit rot in other extensions (never contributed) that I
have not maintained since 2009 and I now I am unable to fix some of them,
so I wonder how much of old knowledge is still applicable. In other words,
is what I see in new code just a change of macros or the change of
principles?

Thanks,

--Gene


Re: genomic locus

2017-12-21 Thread Gene Selkov

> On Dec 15, 2017, at 4:50 PM, Michael Paquier  
> wrote:
> If you wish to fix seg in some way, you could always
> patch them. But I am not sure what you are trying to fix, so more
> details would be welcome.

I was contemplating how much functionality I could borrow from seg to build 
another interval type and what unforeseen hurdles would emerge while I was 
digging into it. It turned out to be less straightforward than I thought. 

>> I have seen a lot of bit rot in other extensions (never contributed) that I
>> have not maintained since 2009 and I now I am unable to fix some of them, so
>> I wonder how much of old knowledge is still applicable. In other words, is
>> what I see in new code just a change of macros or the change of principles?
> 
> APIs in Postgres are usually stable. You should be able to update your
> own extensions. If you want to discuss about a couple of things in
> particular, don't hesitate!

Thank you Michael. I will summarize the problems I have already encountered in 
a later reply to this thread.

I do find the API to be unchanged, but I get the sense that some macros are 
new. Maybe it’s just my bad memory. Overall, I am pleased with a much better 
automation of extension building and testing.

—Gene




Re: genomic locus

2017-12-21 Thread Gene Selkov
Nice work, Andrew!

I wish I knew about it earlier.

> On Dec 16, 2017, at 8:53 AM, Andrew Dunstan  
> wrote:
> 
> I keep most of the out-of-tree extensions I maintain green by building
> and testing them in a buildfarm member. That way I become aware pretty
> quickly if any API change has broken them, as happened just the other
> day in fact. To do this requires writing a small perl paqckage. There
> are three examples in the buildfarm client sources at
> https://github.com/PGBuildFarm/client-code/tree/master/PGBuild/Modules
> and one of these is included in the buildfarm client releases.



Re: genomic locus

2017-12-21 Thread Gene Selkov

> On Dec 17, 2017, at 7:57 PM, Robert Haas  wrote:
> 
> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov  wrote:
>> I need a data type to represent genomic positions, which will consist of a
>> string and a pair of integers with interval logic and access methods. Sort
>> of like my seg type, but more straightforward.
> 
> Have you thought about just using a composite type?

Yes, I have. That is sort of what I have been doing; a composite type certainly 
gets the job done but I don’t feel it reduces query complexity, at least from 
the user’s point of view. Maybe I don’t know enough.

Here’s an example of how I imagine a composite genomic locus (conventionally 
represented as text ‘:’ integer ‘-‘ integer):

CREATE TYPE locus AS (contig text, coord int4range);
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
\copy test_locus from test_locus.tab

Where test_locus.tab has stuff like:

(chr3,"[178916937,178916940]")  GAA CHP2_PIK3CA_2
(chr3,"[178916939,178916948]")  AGGAT   CHP2_PIK3CA_2
(chr3,"[178916940,178916941]")  G   A   CHP2_PIK3CA_2
(chr3,"[178916943,178916944]")  A   G   CHP2_PIK3CA_2
(chr3,"[178916943,178916946]")  AAG CHP2_PIK3CA_2
(chr3,"[178916943,178916952]")  AAGATCCTC   CHP2_PIK3CA_2
(chr3,"[178916944,178916945]")  A   G   CHP2_PIK3CA_2
(chr3,"[178916945,178916946]")  G   C   CHP2_PIK3CA_2
(chr3,"[178916945,178916946]")  G   T   CHP2_PIK3CA_2
(chr3,"[178916945,178916948]")  GAT CHP2_PIK3CA_2

When the table is loaded, I can pull the subset shown above with this query:

SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && 
'[178916937, 178916948]’;
  pos   |ref| alt |  id
+---+-+---
 (chr3,"[178916937,178916941)") | GAA   | | CHP2_PIK3CA_2
 (chr3,"[178916939,178916949)") | AGGAT | | CHP2_PIK3CA_2
 . . . .

So far so good. It gets the job done. However, it is only a small step towards 
a fully encapsulated, monolithic type I want it to be. The above query It is 
marginally better than its atomic-type equivalent:

SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= 
least(stop, 178916948);
 contig |   start   |   stop|ref| alt |  id
+---+---+---+-+---
 chr3   | 178916937 | 178916940 | GAA   | | CHP2_PIK3CA_2
 chr3   | 178916939 | 178916948 | AGGAT | | CHP2_PIK3CA_2
 . . . .

and it requires addition syntax transformations steps to go from conventional 
locus representation 'chr3:178916937-178916940' to composite 
'(chr3,"[178916937,178916940]”)’ and back.

Of course, the relative benefits of partial encapsulation I achieve by bundling 
text with int4range accumulate, compared to (text, int4, int4), as queries grow 
more complex. But because the elements of a composite type still require a 
separate query term for each of them (unless there is some magic I am not aware 
of), the complexity of a typical query I need to run exceeds my feeble 
sight-reading capacity. I want things that are conceptually simple to be 
expressed in simple terms, if possible.

Like so:

CREATE EXTENSION locus;
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
\copy test_locus from data/oncomine.hotspot.tab

SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
   pos|ref| alt |  id
--+---+-+---
 chr3:178916937-178916940 | GAA   | | CHP2_PIK3CA_2
 chr3:178916939-178916948 | AGGAT | | CHP2_PIK3CA_2
 chr3:178916940-178916941 | G | A   | CHP2_PIK3CA_2
 chr3:178916943-178916944 | A | G   | CHP2_PIK3CA_2
 chr3:178916943-178916946 | AAG   | | CHP2_PIK3CA_2
 chr3:178916943-178916952 | AAGATCCTC | | CHP2_PIK3CA_2
 chr3:178916944-178916945 | A | G   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G | C   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G | T   | CHP2_PIK3CA_2
 chr3:178916945-178916948 | GAT   | | CHP2_PIK3CA_2
(10 rows)

I have encountered some pesky geometry / indexing problems while building this 
extension (https://github.com/selkovjr/locus 
<https://github.com/selkovjr/locus>), but I hope I can solve them at least at 
the level afforded by the composite type, while keeping the clean interface of 
a monolithic type. I understand I could probably achieve the same cleanliness 
by defining functions and operators over the complex type, but by the time I’m 
done with that, will I have coded about the same amount of stuff as required to 
build an extended type?


Regards,

—Gene






Re: genomic locus

2017-12-21 Thread Gene Selkov

> On Dec 18, 2017, at 5:00 AM, PostgreSQL - Hans-Jürgen Schönig 
>  wrote:
> 
> maybe this one is also helpful: 
> https://wiki.postgresql.org/images/1/1b/Postbis_pgcon_eu_2012.pdf 
> 
> it seems they have put a lot of work into that.

Yes, that is a curious piece of code and I will give it a test, although it is 
not related to the problem at hand. It is about representing sequences, while I 
am struggling to create a convenient representation of loci.

I liked the slide show you linked. It touched on a pertinent subject (although 
again, not related to loci). Just last week, I attempted to help a relative who 
works on a microbiome project to sort through the guidelines about the encoding 
of altitude, elevation, and depth thought up by somebody at EBI without much 
regard for their usability. We haven’t had much success in sorting that out 
because the guidelines were obviously written for other people to follow; their 
practical application resulted in total confusion.

For example, is this definition redundant or contradictory:

"The altitude of the sample is the vertical distance between Earth's surface 
above sea level and the sampled position in the air”

Is that AGL or AMSL? Why not use a dictionary definition? Because of stuff like 
that, I treat all products by biologists with suspicion, until proven workable 
— especially stuff developed by consortia of multiple institutions.

Anyway, thanks for the link.

—Gene

Re: genomic locus

2017-12-21 Thread Gene Selkov

> On Dec 18, 2017, at 6:59 AM, Craig Ringer  wrote:
> 
> If you think it'd make logical sense to extend seg with a string descriptor 
> of some sort and could come up with a name/use case that's not quite so 
> narrowly focused as genetics alone, then I could see adding it as a secondary 
> type in the same extension.
> 
> But it's more likely that the best course would be to extract the seg 
> extension from core, rename it, hack it as desired, and build it as an 
> extension maintained out-of-tree. 

That is exactly what I’ve been doing for a few days now, and the process is 
testing my sanity.

Attaching a string to an interval seems like an easy enough undertaking, and I 
have got it to work at the UI level, at least. The queries I wanted to be able 
to make against it run without problems and produce the desired results. Here 
is my first attempt: https://github.com/selkovjr/locus

Problems arise around queries I didn’t expect to be making and there are issues 
around indexing that I am not sure how to solve.

The main problem is that attaching a tag to an interval makes it incommensurate 
with intervals having a different tag. That makes them hard to index with an 
access method based on containment, such as GiST.

Problem 1. What is a union of ‘1:6000-7000’ and ‘X:1-2’? Intuitively, 
it should be NULL, however, I am not sure the method allows for that; it was 
developed for objects living in the same metric space. I have mechanistically 
reproduce the indexing methods of seg, but the resulting index is broken. All 
queries against an indexed table return a null result.

Problem 2. While the intersection (overlap, &c.) of any two loci produces 
obvious results, non-intersection does not. When I query for all loci not 
overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 
1. I don’t want the query to return anything from other contigs, because it is 
obvious that features on different contigs do not overlap. I may be able to fix 
that by making separate functions for non-overlaps and adding a constraint to 
them, but that seems like a kludge.

Problem 3 (alternative to 1). I realize that any clustering can help build an 
efficient index, no matter how bizarre. So I could, for example, ignore the 
contigs altogether and build a single index tree, using only position 
co-ordinates and pretending that all positions are on the same contig; the 
question then is whether and how such lossy index will affect the ordering of 
query results. Can I use a separate function for ordering? I have yet to make 
an experiment. Not that this would be equivalent to indexing the attributes of 
a composite type separately (if I understood it correctly).

An alternative to neglecting the contig element might be to use it as a second 
dimension. Expressed that way, a union of several loci might consist of a set 
of contig names attached to the bounding interval. Not sure whether that makes 
any sense; in the first approximation, I imagine something equivalent to 
storing each contig’s data in a separate table with a separate index, except 
derived from a single actual database table, but I have no clue for how to go 
about doing that.


Thanks,

—Gene




Re: genomic locus

2017-12-21 Thread Gene Selkov
Hi Oleg,

Great to hear from you. I wondered how many of the old-timers were still around.

> Why not use composite type ? For simple interval approach it's worked for us
> (see attached hdate.sql).

I have just begun looking at your hdate example; I see potentially useful stuff 
in it, but the first thing that I noticed is hat it is not fully equivalent to 
my problem. It looks like you only need to match intervals, while I need to 
match intervals and something else — ideally, in a single operation. I 
attempted to explain that in my reply to Craig Ringer.

> If you need to specify distribution
> function,

Not in this case; there is no uncertainty associated with the loci; where there 
is uncertainty is in the existence of a feature called at a locus: is it real 
or is it a technogenic artifact? But that is a different problem for a later 
day.

> than it may be
> worth to see orion project http://orion.cs.purdue.edu/index.html
> 6 years ago we was thinking about implementation special UNCERTAINTY data type
> (http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but 
> never
> started :( It'd be nice if you start this very interesting for science 
> project.

I love uncertainty, and I’ve always wished I could make it computable. I also 
wish folks around me had the same appreciation for it. My job is to say yes or 
no where the data suggest maybe, or maybe not. Needless to say, I feel a bit 
exercised.

I am reading the info you provided with keen interest.

> btw, now you can use range data type, check
> https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf 
> 

Great stuff, I was not aware of it. I saw it in early development but did not 
know it made it to the core. I tried it (and will go and update a few kludgy 
apps where I had to use bad surrogates). It is not directly applicable to 
genomic loci because it will require additional constraints for intelligent 
matching. I want to go for compete encapsulation of constraints.

Part of the reason for such a perverse desire is that I use the database as a 
calculator — that is, I load some data in a one-off experiment and I literally 
type everything in psql while I muddle through. There is a limit on how much I 
can type and not screw things up beyond comprehension, so I want the query 
language to be as easy and interactive as possible. Having to drag along a set 
of additional constraints is not quite interactive and is error-prone.


Regards,

—Gene

Re: genomic locus

2017-12-22 Thread Gene Selkov

> On Dec 22, 2017, at 1:53 AM, Teodor Sigaev  wrote:
> 
> Hmm, would you try to implement separate type for querying? Similar to 
> tsquery, lquery (for ltree), jsquery etc.

That sounds like a good idea if I want to make an app that will only be 
accessed through a purpose-built front end. Now I’m messing with lots of data, 
making numerous small one-off experiments. Since all of that stuff consumes my 
brain power and keystrokes, I want to minimize all the ancillary stuff or at 
least make it invisible. But I’ll probably go ahead and add a separate 
query-friendly type  that if nothing else helps.

I think I can wrangle this type into GiST just by tweaking consistent(), 
union(), and picksplit(), if I manage to express my needs in C without breaking 
too many things. My first attempt segfaulted.

Here’s my plan of attack. I think that by setting a union of inconsistent loci 
(those on different contigs) to [0, MAX_INT], I will expose such union to a 
huge penalty, even without having to do anything with the current penalty(). No 
query will go there.

The required change to consistent() is obvious: contigs do not match — go away.

The situation with picksplit() may be more tricky; I can’t imagine all possible 
consequences until I’ve seen how it works. Currently, with simple intervals, it 
sorts them by center points and splits the sorted list in half. I have already 
changed the internal sort function, picksplit_item_cmp(), to make sure the data 
are sorted on the contig first, then by center point (or any other geometric 
feature). I am thinking about splitting the list at the first inconsistent 
contig, sending the consistent first part with its well-defined geometry to the 
left page, and filling the right page with whatever remains. If the right page 
has inconsistent contigs, its bounding box will be [0, MAX_INT], and it should 
be again picked for splitting at the next iteration (if I understand the 
algorithm correctly).

If all goes to plan, I will end up with an index tree partitioned by contig at 
the top level and geometrically down from there. That will be as close as I can 
get to an array of config-specific indices, without having to store data in 
separate tables.

What do you think of that?



I have a low-level technical question. Because I can’t anticipate the maximum 
length of contig names (and do not want to waste space), I have made the new 
locus type a varlena, like this:

#include "utils/varlena.h"

typedef struct LOCUS
{
  int32 l_len_; /* varlena header (do not touch directly!) */
  int32 start;
  int32 end;
  char  contig[FLEXIBLE_ARRAY_MEMBER];
} LOCUS;

#define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str))

That flexible array member messes with me every time I need to copy it while 
deriving a new locus object from an existing one (or from a pair). What I ended 
up doing is this:

  LOCUS  *l = PG_GETARG_LOCUS_P(0);
  LOCUS  *new_locus;
  char   *contig;
  intsize;

  new_locus = (LOCUS *) palloc0(sizeof(*new_locus));
  contig = pstrdup(l->contig); // need this to determine the length of contig 
name at runtime
  size = LOCUS_SIZE(contig);
  SET_VARSIZE(new_locus, size);
  strcpy(new_locus->contig, contig);

Is there a more direct way to clone a varlena structure (possibly assigning an 
differently-sized contig to it)? One that is also memory-safe?


Thank you,

—Gene


> Gene Selkov wrote:
>>> On Dec 17, 2017, at 7:57 PM, Robert Haas >> <mailto:robertmh...@gmail.com> <mailto:robertmh...@gmail.com 
>>> <mailto:robertmh...@gmail.com>>> wrote:
>>> 
>>> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov >> <mailto:selko...@gmail.com> <mailto:selko...@gmail.com 
>>> <mailto:selko...@gmail.com>>> wrote:
>>>> I need a data type to represent genomic positions, which will consist of a
>>>> string and a pair of integers with interval logic and access methods. Sort
>>>> of like my seg type, but more straightforward.
>>> 
>>> Have you thought about just using a composite type?
>> Yes, I have. That is sort of what I have been doing; a composite type 
>> certainly gets the job done but I don’t feel it reduces query complexity, at 
>> least from the user’s point of view. Maybe I don’t know enough.
>> Here’s an example of how I imagine a composite genomic locus (conventionally 
>> represented as text ‘:’ integer ‘-‘ integer):
>> CREATE TYPE locus AS (contig text, coord int4range);
>> CREATE TABLE test_locus (
>>   pos locus,
>>   ref text,
>>   alt text,
>>   id text
>> );
>> CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
>> \copy test_locus from test_locus.tab
>> Where test_locus.tab has stuff like:
>> (chr3,"[178916937,178916940]&q