On Fri, Dec 21, 2018 at 2:44 AM Paul Moore <p.f.mo...@gmail.com> wrote: > > I'm looking for a way to incrementally decode a JSON file. I know this > has come up before, and in general the problem is not soluble (because > in theory the JSON file could be a single object). In my particular > situation, though, I have a 9GB file containing a top-level array > object, with many elements. So what I could (in theory) do is to parse > an element at a time, yielding them. > > The problem is that the stdlib JSON library reads the whole file, > which defeats my purpose. What I'd like is if it would read one > complete element, then just enough far ahead to find out that the > parse was done, and return the object it found (it should probably > also return the "next token", as it can't reliably push it back - I'd > check that it was a comma before proceeding with the next list > element).
It IS possible to do an incremental parse, but for that to work, you would need to manually strip off the top-level array structure. What you'd need to use would be this: https://docs.python.org/3/library/json.html#json.JSONDecoder.raw_decode It'll parse stuff and then tell you about what's left. Since your data isn't coming from a ginormous string, but is coming from a file, you're probably going to need something like this: def get_stuff_from_file(f): buffer = "" dec = json.JSONDecoder() while "not eof": while "no object yet": try: obj, pos = dec.raw_decode(buffer) except JSONDecodeError: buffer += f.read(1024) else: break yield obj buffer = buffer[pos:].lstrip().lstrip(",") Proper error handling is left as an exercise for the reader, both in terms of JSON errors and file errors. Also, the code is completely untested. Have fun :) The basic idea is that you keep on grabbing more data till you can decode an object, then you keep whatever didn't get used up ("pos" points to whatever didn't get consumed). Algorithmic complexity should be O(n) as long as your objects are relatively small, and you can optimize disk access by tuning your buffer size to be at least the average size of an object. Hope that helps. ChrisA -- https://mail.python.org/mailman/listinfo/python-list