On 1 Oct 2010, at 20:57, Mitch Skinner wrote:
> In JBrowse, the "schema" can vary by track; my assumption was that the set of
> populated attributes in an individual track would be pretty uniform. Some
> tracks might not use the "phase" field, for example, but if a given track
> used phase information, then I figured that almost all of the features in
> that track would populate that field.
Indeed, and so long as your server/data file can determine in advance that
phase is not used at all (as you say), it can omit it entirely. I'd say most of
the time that's going to be perfectly possible. And to be honest I'm not wholly
convinced that DAS has significantly less uniformity in the fields used within
a data set, but it's worth us bearing in mind.
> Also, javascript allows for array entries to be omitted entirely, like:
>
> [10000, 15000, , "foo"]
>
> JSON theoretically doesn't allow this; omitted entries become "undefined" in
> javascript, and the "official" JSON spec disallows "undefined"
Interesting, I must admit my use of JSON is fairly limited so I wasn't aware of
this possibility. It doesn't sound "nice", but when you're on the edge trying
to squeeze out all the speed you can, it doesn't seem an issue.
>
> Also, depending on the use case, I wonder if the difference in
> (de)compression time between indexed and keyed JSON would matter. If you
> have your generated data handy still, I'd be curious to know what the
> difference is.
I don't have the original files, but I do have the script to generate some more
similar ones (attached). You could play around with whitespace too if you want
to do some optimisation.
use strict;
my @keys = ('id','start','end','segment','ori');
open(INDEXED, '>', 'filesize-indexed.json');
open(KEYED, '>', 'filesize-keyed.json');
open(INDEXED2, '>', 'filesize-indexed-partial.json');
open(KEYED2, '>', 'filesize-keyed-partial.json');
print KEYED "[\n";
print INDEXED "[\n [\n";
print KEYED2 "[\n";
print INDEXED2 "[\n [\n";
for my $key (@keys) {
print INDEXED " \"$key\",\n";
print INDEXED2 " \"$key\",\n";
}
print INDEXED " ],\n";
print INDEXED2 " ],\n";
for (my $i=0; $i<100000; $i++) {
print KEYED " {\n";
print INDEXED " [\n";
print KEYED2 " {\n";
print INDEXED2 " [\n";
for my $key (@keys) {
my $val = int(rand(1000));
if (int(rand(10)) < 1) {
print INDEXED2 " \"\",\n";
} else {
print INDEXED2 " \"$val\",\n";
print KEYED2 " \"$key\" : \"$val\",\n";
}
print KEYED " \"$key\" : \"$val\",\n";
print INDEXED " \"$val\",\n";
}
print KEYED " },\n";
print INDEXED " ],\n";
}
print KEYED "]\n";
print INDEXED "]\n";
print KEYED2 "]\n";
print INDEXED2 "]\n";
close KEYED;
close INDEXED;
close KEYED2;
close INDEXED2;
I'm sure there is a lot of existing study behind this very question. My
expectation would be that the overhead of compression/decompression is going to
be worth it for anything larger than 100 kb or so (probably even less), such is
the huge difference it makes to the size of these files and bandwidth is
usually the rate-limiting step. Up to now I have always considered that the
chances are, if you're worried about speed and the size of your files, you
probably need compression.
Cheers,
Andy
_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das