On Oct 12, 2012, at 9:40 AM, Chopin <robert....@gmail.com> wrote:
> 
> I got this 109 MB json file that I read... and it takes over 32
> seconds for parseJSON() to finish it. So I was wondering if it
> was a way to save it as binary or something like that so I can
> read it super fast?

The performance problem is because std.json works like a DOM parser for XML--it 
allocates a node per value in the JSON stream.  What we really need is 
something that works more like a SAX parser with the DOM version as an optional 
layer built on top.  Just for kicks, I grabbed the fourth (largest) JSON blob 
from here:

http://www.json.org/example.html

then wrapped it in array tags and duplicated the object until I had a ~350 MB 
input file.  ie.

[ paste, paste, paste, … ]

Then I parsed it via this test app, based on an example in a SAX-style JSON 
parser I wrote in C:


import core.stdc.stdlib;
import core.sys.posix.unistd;
import core.sys.posix.sys.stat;
import core.sys.posix.fcntl;
import std.json;

void main()
{
    auto filename = "input.txt\0".dup;

    stat_t st;
    stat(filename.ptr, &st);
    auto sz = st.st_size;
    auto buf = cast(char*) malloc(sz);
    auto fh = open(filename.ptr, O_RDONLY);
    read(fh, buf, sz);

    auto json = parseJSON(buf[0 .. sz]);
}


Here are my results:


$ dmd -release -inline -O dtest
$ ll input.txt
-rw-r--r--  1 sean  staff  365105313 Oct 12 15:50 input.txt
$ time dtest

real  1m36.462s
user 1m32.468s
sys   0m1.102s
 

Then I ran my SAX style parser example on the same input file:


$ make example
cc example.c -o example lib/release/myparser.a
$ time example

real  0m2.191s
user 0m1.944s
sys   0m0.241s


So clearly the problem isn't parsing JSON in general but rather generating an 
object tree for a large input stream.  Note that the D app used gigabytes of 
memory to process this file--I believe the total VM footprint was around 3.5 
GB--while my app used a fixed amount roughly equal to the size of the input 
file.  In short, DOM style parsers are great for small data and terrible for 
large data.


Reply via email to