Re: [DAS] 1.6 draft 7

Andy Jenkinson Wed, 29 Sep 2010 08:58:35 -0700

On 29 Sep 2010, at 10:22, Mitch Skinner wrote:

> On 09/23/2010 07:44 AM, Thomas Down wrote:
>> Just getting in touch to let you know that we're very interested in getting
>> big piles of sequencing data onto genome browsers, and are certainly
>> interested in alternative formats.  My current thoughts are that -- at least
>> for what we're doing -- a slightly more concise XML schema, plus richer
>> control of server-side summarization/binning (think maxbins++) might be the
>> sweet spot, but very interested to see other options as well.
> 
> Hello,
> 
> I've only just joined the list, so I think I'm entering this conversation 
> mid-way, and I apologize if I'm speaking out of turn.  I do have a few 
> opinions about putting genomic data into JSON, though, and it looks like this 
> might be the time to express them, if you're discussing DAS alternative 
> content formats.
> 
> As you may know, JBrowse client-server communication is (currently) not DAS.  
> And that's on purpose, for multiple reasons that may be relevant to this 
> discussion.  They are:
> 
> 1. XML verbosity
> 2. cacheability
> 3. need for server-side code


Hi Mitch,

Thanks for the comments, it's very useful to see the reasons for your design 
choices, no doubt we can learn from them when we adopt JSON as an alternative 
format.

Warning: geek mode on

I found all points interesting, but one in particular immediately piqued my 
curiosity: indexed-vs-keyed. Your suggestion certainly makes sense in that 
keyed json files are usually going to be bigger. However, I think it's worth 
considering to what degree the benefit of adopting your approach is achievable 
in "real life". In particular there are two factors likely to reduce the 
difference:
1. compression (the keys are repetitive and so theoretically ideal to compress)
2. objects with missing fields (hashes can omit a key if there is no value, but 
arrays must always include an empty string or null value). It is common in DAS 
for objects of different types to not have the same data model.

For 2, consider this "hash" example:

[
 {
  "id" : "feature1",
  "note" : "foo"
  "ori" : "+"
 },
 {
  "id" : "feature2",
 }, 
 ...
]

An "array" version of the same thing:

[
 [
  "id",
  "note",
  "ori"
 ],
 [
  "feature1",
  "foo",
  "+"
 ],
 [
  "feature2",
  null,
  null
 ],
 ...
]

Ignoring the minimal overhead of the "header row", the sizes could conceivably 
end up reasonably similar and possibly even bigger.

In fact I did a little test: I randomly generated 100,000 features in a JSON 
format with 5 small string fields (indexed file size ~5 mb). It turns out that 
when uncompressed a keyed file is indeed much bigger (87.5%). After compression 
this went down to 25.8% though. When 10% of fields were empty (i.e. for the 
indexed style empty strings, and for the hashed style omitted pairs), this went 
down again to 16.8%. This last bit surprised me to be honest, I expected a much 
more modest effect. For 10% of fields in a dataset to be empty seems a 
reasonable expectation to me too, so if your data model has a variable 
"occupancy" of fields across rows it's worth considering I would suggest. In 
particular, using 'null' instead of empty strings would have an even bigger 
effect (I don't know what JBrowse does in such circumstances).

Regarding memory, if you think about it there is nothing to stop you using 
arrays in code regardless of the file format, should memory be an issue.

None of this is intended to suggest that your model is anything but optimal for 
JBrowse, and indeed even 10% of a 1mb file is still a second or two and worth 
saving for sure.

But for the benefit of the rest of the DAS list with regard to using it for a 
JSON version of the DAS XML format we need to think carefully about it. We have 
a format with actually quite a lot of potential fields, of which many are 
commonly unused. A hunch tells me that, despite this, most datasets are uniform 
enough to make it worth going the JBrowse way, but it'd be nice to see some 
other thoughts on the matter.

/geek

Cheers,
Andy

> 
> JBrowse does something different from DAS in each of those areas.  In order 
> of decreasing ease-of-adoption, they are:
> 
> 1. XML verbosity:
> 
> It's true that if you gzip then this doesn't cost you too much in terms of 
> size-on-the-wire, but you still have to deal with it in memory on the client 
> and server.
> 
> Most of the genomics data formats I've seen have attributes that are either 
> *named* or *positional*.  Positional data formats, like GFF2, specify in 
> advance the attributes (columns) that each record (line) can have.  This 
> saves space, but makes the format awkward to extend (e.g., what happened when 
> different people extended GFF in different ways).  With XML, by contrast, 
> each attribute is named.  That makes the tradeoff the other way around; it 
> takes up more space, but it's easier to extend, because new attributes get 
> new names and don't step on each other.
> 
> And then there are hybrids like GFF3, where some attributes are specified 
> positionally, and the rest are named.
> 
> But there's another middle ground--you can have positional attributes, but 
> have the *positions* be named.  This is what JBrowse does; each feature is 
> represented in JSON as an array.  Each position in the array is named, in a 
> separate array of strings.  In other words, there are a bunch of genomic 
> features like this:
> 
> [
> [10000,20000,-1,"foo"],
> [50000,80000,1,"bar]
> ]
> 
> and a separate array (not sure what to call it, a "schema" array, maybe?) 
> that names each position, e.g.:
> 
> ["start", "end", "strand", "name"]
> 
> or what have you.  And, since it's JSON, attribute values can have their own 
> nested structure; for example, one of the attributes could be "subfeatures" 
> which could be an array of subfeatures.  In JBrowse, each track can have its 
> own "schema"; attributes like "phase", "score", and "subfeatures" can be left 
> out of a track if it doesn't use them.
> 
> For genomic applications, where the space used by the schema array can be 
> amortized over large numbers of features, I think this is a pretty 
> substantial win.  And if you need ad-hoc additional attributes, you can 
> always pull the GFF3 trick and have a set of named attributes at the end.
> 
> 2. Cacheability
> 
> JBrowse partitions feature data by refseq, track, and genomic region.  And 
> the boundaries between those chunks of data are defined statically.  That 
> makes those chunks of data more cacheable; if you're viewing a given region, 
> and you move a little to the left or to the right, you usually don't need to 
> make a new HTTP request.  And if you come back to the same region the next 
> day, your browser will likely have the same data in its cache.
> 
> You could, potentially, provide useful caching-related HTTP headers from 
> current DAS servers, but my understanding is that implementations typically 
> don't.  And I think that has something to do with dynamic queries being 
> harder to usefully cache.
> 
> Gregg has been working on proxying DAS sources for JBrowse clients; I asked 
> him if his proxy would provide useful caching headers, and he said, "well, I 
> could make them up" :)  But it would be nice if we could get them without 
> having to do that.
> 
> 3. Server-side code
> 
> The fact that the boundaries of the data chunks are statically defined means 
> that you can just pre-generate them and serve them with a plain static HTTP 
> server.  Sometimes I meet people who disagree with this choice, but if you 
> have have a workload that's dominated by reads then I think it's a pretty 
> clear win, especially if you have a lot of users.  Plus, the HTTP server will 
> generate appropriate caching-related headers for free.  And it's easier for 
> less-technical users, or users with limited rights on a server, to set up a 
> JBrowse instance if they don't need to set up CGI/servlets/whatever.
> 
> BAM and BigBed have made a similar choice to be statically-serveable; you 
> could view the JBrowse approach as doing something like what they're doing, 
> in a way that's easier to digest for web browser clients.  In the JBrowse 
> case, range queries are done by the client using a lazily-loaded nested 
> containment list, described here:
> 
> http://biowiki.org/view/JBrowse/LazyFeatureLoading
> 
> It's possible to store the JSON gzipped on the server and then send it out 
> as-is; I compress the json files and give them a .jsonz extension, and then 
> add this to my apache config:
> 
> <Files *.jsonz>
>  ForceType text/javascript
>  Header set Content-Encoding: gzip
> </Files>
> 
> which works in all the web browsers that JBrowse supports, including IE 6/7/8.
> 
> I hope this contributes to the discussion,
> Mitch
> _______________________________________________
> DAS mailing list
> [email protected]
> http://lists.open-bio.org/mailman/listinfo/das


_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Re: [DAS] 1.6 draft 7

Reply via email to