On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <hough...@oclc.org> wrote:
> OK, I will bite, you stated: > > 1. That large datasets are a problem. > 2. That streaming APIs are a pain to deal with. > 3. That tool sets have memory constraints. > > So how do you propose to process large JSON datasets that: > > 1. Comply with the JSON specification. > 2. Can be read by any JavaScript/JSON processor. > 3. Do not require the use of streaming API. > 4. Do not exceed the memory limitations of current JSON processors. > > What I'm proposing is that we don't process large JSON datasets; I'm proposing that we process smallish JSON documents one at a time by pulling them out of a stream based on an end-of-record character. This is basically what we use for MARC21 binary format -- have a defined structure for a valid record, and separate multiple well-formed record structures with an end-of-record character. This preserves JSON specification adherence at the record level and uses a different scheme to represent collections. Obviously, MARC-XML uses a different mechanism to define a collection of records -- putting well-formed record structures inside a <collection> tag. So... I'm proposing define what we mean by a single MARC record serialized to JSON (in whatever format; I'm not very opinionated on this point) that preserves the order, indicators, tags, data, etc. we need to round-trip between marc21binary, marc-xml, and marc-json. And then separate those valid records with an end-of-record character -- "\n". Unless I've read all this wrong, you've come to the conclusion that the benefit of having a JSON serialization that is valid JSON at both the record and collection level outweighs the pain of having to deal with a streaming parser and writer. This allows a single collection to be treated as any other JSON document, which has obvious benefits (which I certainly don't mean to minimize) and all the drawbacks we've been talking about *ad nauseam *. I go the the other way. I think the pain of dealing with a streaming API outweighs the benefits of having a single valid JSON structure for a collection, and instead have put forward that we use a combination of JSON records and a well-defined end-of-record character ("\n") to represent a collection. I recognize that this involves providing special-purpose code which must call for JSON-deserialization on each line, instead of being able to throw the whole stream/file/whatever at your json parser is. I accept that because getting each line of a text file is something I find easy compared to dealing with streaming parsers. And our point of disagreement, I think, is that I believe that defining the collection structure in such a way that we need two steps (get a line; deserialize that line) and can't just call the equivalent of JSON.parse(stream) has benefits in ease of implementation and use that outweigh the loss of having both a single record and a collection of records be valid JSON. And you, I think, don't :-) I'm going to bow out of this now, unless I've got some part of our positions wrong, to let any others that care (which may number zero) chime in. -Bill- -- Bill Dueber Library Systems Programmer University of Michigan Library