Re: [DAS] 1.6 draft 7

Mitch Skinner Wed, 29 Sep 2010 02:38:20 -0700

 On 09/23/2010 07:44 AM, Thomas Down wrote:

Just getting in touch to let you know that we're very interested in getting
big piles of sequencing data onto genome browsers, and are certainly
interested in alternative formats.  My current thoughts are that -- at least
for what we're doing -- a slightly more concise XML schema, plus richer
control of server-side summarization/binning (think maxbins++) might be the
sweet spot, but very interested to see other options as well.


Hello,

I've only just joined the list, so I think I'm entering thisconversation mid-way, and I apologize if I'm speaking out of turn. I dohave a few opinions about putting genomic data into JSON, though, and itlooks like this might be the time to express them, if you're discussingDAS alternative content formats.

As you may know, JBrowse client-server communication is (currently) notDAS. And that's on purpose, for multiple reasons that may be relevantto this discussion. They are:


1. XML verbosity
2. cacheability
3. need for server-side code

JBrowse does something different from DAS in each of those areas. Inorder of decreasing ease-of-adoption, they are:


1. XML verbosity:

It's true that if you gzip then this doesn't cost you too much in termsof size-on-the-wire, but you still have to deal with it in memory on theclient and server.

Most of the genomics data formats I've seen have attributes that areeither *named* or *positional*. Positional data formats, like GFF2,specify in advance the attributes (columns) that each record (line) canhave. This saves space, but makes the format awkward to extend (e.g.,what happened when different people extended GFF in different ways).With XML, by contrast, each attribute is named. That makes the tradeoffthe other way around; it takes up more space, but it's easier to extend,because new attributes get new names and don't step on each other.

And then there are hybrids like GFF3, where some attributes arespecified positionally, and the rest are named.

But there's another middle ground--you can have positional attributes,but have the *positions* be named. This is what JBrowse does; eachfeature is represented in JSON as an array. Each position in the arrayis named, in a separate array of strings. In other words, there are abunch of genomic features like this:


[
[10000,20000,-1,"foo"],
[50000,80000,1,"bar]
]

and a separate array (not sure what to call it, a "schema" array,maybe?) that names each position, e.g.:


["start", "end", "strand", "name"]

or what have you. And, since it's JSON, attribute values can have theirown nested structure; for example, one of the attributes could be"subfeatures" which could be an array of subfeatures. In JBrowse, eachtrack can have its own "schema"; attributes like "phase", "score", and"subfeatures" can be left out of a track if it doesn't use them.

For genomic applications, where the space used by the schema array canbe amortized over large numbers of features, I think this is a prettysubstantial win. And if you need ad-hoc additional attributes, you canalways pull the GFF3 trick and have a set of named attributes at the end.


2. Cacheability

JBrowse partitions feature data by refseq, track, and genomic region.And the boundaries between those chunks of data are defined statically.That makes those chunks of data more cacheable; if you're viewing agiven region, and you move a little to the left or to the right, youusually don't need to make a new HTTP request. And if you come back tothe same region the next day, your browser will likely have the samedata in its cache.

You could, potentially, provide useful caching-related HTTP headers fromcurrent DAS servers, but my understanding is that implementationstypically don't. And I think that has something to do with dynamicqueries being harder to usefully cache.

Gregg has been working on proxying DAS sources for JBrowse clients; Iasked him if his proxy would provide useful caching headers, and hesaid, "well, I could make them up" :) But it would be nice if we couldget them without having to do that.


3. Server-side code

The fact that the boundaries of the data chunks are statically definedmeans that you can just pre-generate them and serve them with a plainstatic HTTP server. Sometimes I meet people who disagree with thischoice, but if you have have a workload that's dominated by reads then Ithink it's a pretty clear win, especially if you have a lot of users.Plus, the HTTP server will generate appropriate caching-related headersfor free. And it's easier for less-technical users, or users withlimited rights on a server, to set up a JBrowse instance if they don'tneed to set up CGI/servlets/whatever.

BAM and BigBed have made a similar choice to be statically-serveable;you could view the JBrowse approach as doing something like what they'redoing, in a way that's easier to digest for web browser clients. In theJBrowse case, range queries are done by the client using a lazily-loadednested containment list, described here:


http://biowiki.org/view/JBrowse/LazyFeatureLoading

It's possible to store the JSON gzipped on the server and then send itout as-is; I compress the json files and give them a .jsonz extension,and then add this to my apache config:


<Files *.jsonz>
  ForceType text/javascript
  Header set Content-Encoding: gzip
</Files>

which works in all the web browsers that JBrowse supports, including IE6/7/8.


I hope this contributes to the discussion,
Mitch
_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Re: [DAS] 1.6 draft 7

Reply via email to