> On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmh...@gmail.com> wrote:
> 
> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selko...@gmail.com> wrote:
>> I need a data type to represent genomic positions, which will consist of a
>> string and a pair of integers with interval logic and access methods. Sort
>> of like my seg type, but more straightforward.
> 
> Have you thought about just using a composite type?

Yes, I have. That is sort of what I have been doing; a composite type certainly 
gets the job done but I don’t feel it reduces query complexity, at least from 
the user’s point of view. Maybe I don’t know enough.

Here’s an example of how I imagine a composite genomic locus (conventionally 
represented as text ‘:’ integer ‘-‘ integer):

CREATE TYPE locus AS (contig text, coord int4range);
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
\copy test_locus from test_locus.tab

Where test_locus.tab has stuff like:

(chr3,"[178916937,178916940]")  GAA             CHP2_PIK3CA_2
(chr3,"[178916939,178916948]")  AGAAAAGAT               CHP2_PIK3CA_2
(chr3,"[178916940,178916941]")  G       A       CHP2_PIK3CA_2
(chr3,"[178916943,178916944]")  A       G       CHP2_PIK3CA_2
(chr3,"[178916943,178916946]")  AAG             CHP2_PIK3CA_2
(chr3,"[178916943,178916952]")  AAGATCCTC               CHP2_PIK3CA_2
(chr3,"[178916944,178916945]")  A       G       CHP2_PIK3CA_2
(chr3,"[178916945,178916946]")  G       C       CHP2_PIK3CA_2
(chr3,"[178916945,178916946]")  G       T       CHP2_PIK3CA_2
(chr3,"[178916945,178916948]")  GAT             CHP2_PIK3CA_2

When the table is loaded, I can pull the subset shown above with this query:

SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && 
'[178916937, 178916948]’;
              pos               |    ref    | alt |      id
--------------------------------+-----------+-----+---------------
 (chr3,"[178916937,178916941)") | GAA       |     | CHP2_PIK3CA_2
 (chr3,"[178916939,178916949)") | AGAAAAGAT |     | CHP2_PIK3CA_2
 . . . .

So far so good. It gets the job done. However, it is only a small step towards 
a fully encapsulated, monolithic type I want it to be. The above query It is 
marginally better than its atomic-type equivalent:

SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= 
least(stop, 178916948);
 contig |   start   |   stop    |    ref    | alt |      id
--------+-----------+-----------+-----------+-----+---------------
 chr3   | 178916937 | 178916940 | GAA       |     | CHP2_PIK3CA_2
 chr3   | 178916939 | 178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
 . . . .

and it requires addition syntax transformations steps to go from conventional 
locus representation 'chr3:178916937-178916940' to composite 
'(chr3,"[178916937,178916940]”)’ and back.

Of course, the relative benefits of partial encapsulation I achieve by bundling 
text with int4range accumulate, compared to (text, int4, int4), as queries grow 
more complex. But because the elements of a composite type still require a 
separate query term for each of them (unless there is some magic I am not aware 
of), the complexity of a typical query I need to run exceeds my feeble 
sight-reading capacity. I want things that are conceptually simple to be 
expressed in simple terms, if possible.

Like so:

CREATE EXTENSION locus;
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
\copy test_locus from data/oncomine.hotspot.tab

SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
           pos            |    ref    | alt |      id
--------------------------+-----------+-----+---------------
 chr3:178916937-178916940 | GAA       |     | CHP2_PIK3CA_2
 chr3:178916939-178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
 chr3:178916940-178916941 | G         | A   | CHP2_PIK3CA_2
 chr3:178916943-178916944 | A         | G   | CHP2_PIK3CA_2
 chr3:178916943-178916946 | AAG       |     | CHP2_PIK3CA_2
 chr3:178916943-178916952 | AAGATCCTC |     | CHP2_PIK3CA_2
 chr3:178916944-178916945 | A         | G   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G         | C   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G         | T   | CHP2_PIK3CA_2
 chr3:178916945-178916948 | GAT       |     | CHP2_PIK3CA_2
(10 rows)

I have encountered some pesky geometry / indexing problems while building this 
extension (https://github.com/selkovjr/locus 
<https://github.com/selkovjr/locus>), but I hope I can solve them at least at 
the level afforded by the composite type, while keeping the clean interface of 
a monolithic type. I understand I could probably achieve the same cleanliness 
by defining functions and operators over the complex type, but by the time I’m 
done with that, will I have coded about the same amount of stuff as required to 
build an extended type?


Regards,

—Gene




Reply via email to