On 1 Oct 2010, at 20:57, Mitch Skinner wrote:

> In JBrowse, the "schema" can vary by track; my assumption was that the set of 
> populated attributes in an individual track would be pretty uniform.  Some 
> tracks might not use the "phase" field, for example, but if a given track 
> used phase information, then I figured that almost all of the features in 
> that track would populate that field.

Indeed, and so long as your server/data file can determine in advance that 
phase is not used at all (as you say), it can omit it entirely. I'd say most of 
the time that's going to be perfectly possible. And to be honest I'm not wholly 
convinced that DAS has significantly less uniformity in the fields used within 
a data set, but it's worth us bearing in mind.

> Also, javascript allows for array entries to be omitted entirely, like:
> 
> [10000, 15000, , "foo"]
> 
> JSON theoretically doesn't allow this; omitted entries become "undefined" in 
> javascript, and the "official" JSON spec disallows "undefined"

Interesting, I must admit my use of JSON is fairly limited so I wasn't aware of 
this possibility. It doesn't sound "nice", but when you're on the edge trying 
to squeeze out all the speed you can, it doesn't seem an issue.

> 
> Also, depending on the use case, I wonder if the difference in 
> (de)compression time between indexed and keyed JSON would matter.  If you 
> have your generated data handy still, I'd be curious to know what the 
> difference is.

I don't have the original files, but I do have the script to generate some more 
similar ones (attached). You could play around with whitespace too if you want 
to do some optimisation.
use strict;

my @keys = ('id','start','end','segment','ori');

open(INDEXED, '>', 'filesize-indexed.json');
open(KEYED,   '>', 'filesize-keyed.json');
open(INDEXED2, '>', 'filesize-indexed-partial.json');
open(KEYED2,   '>', 'filesize-keyed-partial.json');

print KEYED "[\n";
print INDEXED "[\n [\n";
print KEYED2 "[\n";
print INDEXED2 "[\n [\n";

for my $key (@keys) {
	print INDEXED "  \"$key\",\n";
        print INDEXED2 "  \"$key\",\n";
}
print INDEXED " ],\n";
print INDEXED2 " ],\n";

for (my $i=0; $i<100000; $i++) {
	print KEYED " {\n";
	print INDEXED " [\n";
        print KEYED2 " {\n";
        print INDEXED2 " [\n";
	for my $key (@keys) {
		my $val = int(rand(1000));
		if (int(rand(10)) < 1) {
			print INDEXED2 "  \"\",\n";
		} else {
			print INDEXED2 "  \"$val\",\n";
	                print KEYED2 "  \"$key\" : \"$val\",\n";
		} 
		print KEYED "  \"$key\" : \"$val\",\n";
		print INDEXED "  \"$val\",\n";
	}
	print KEYED " },\n";
	print INDEXED " ],\n";
} 
print KEYED "]\n";
print INDEXED "]\n";
print KEYED2 "]\n";
print INDEXED2 "]\n";

close KEYED;
close INDEXED;
close KEYED2;
close INDEXED2;

I'm sure there is a lot of existing study behind this very question. My 
expectation would be that the overhead of compression/decompression is going to 
be worth it for anything larger than 100 kb or so (probably even less), such is 
the huge difference it makes to the size of these files and bandwidth is 
usually the rate-limiting step. Up to now I have always considered that the 
chances are, if you're worried about speed and the size of your files, you 
probably need compression.

Cheers,
Andy
_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Reply via email to