> On Jan 7, 2015, at 2:45 PM, Andrew Dunstan <and...@dunslane.net> wrote: > > > On 01/07/2015 08:25 AM, Aaron Botsis wrote: >> Hi folks, I was having a problem importing json data with COPY. Lots of >> things export data nicely as one json blob per line. This is excellent for >> directly importing into a JSON/JSONB column for analysis. >> >> ...Except when there’s an embedded doublequote. Or anything that’s escaped. >> COPY handles this, but by the time the escaped char hit the JSON parser, >> it's not escaped anymore. This breaks the JSON parsing. This means I need to >> manipulate the input data to double-escape it. See bug #12320 for an >> example. Yuck. >> >> I propose this small patch that simply allows specifying COPY … ESCAPE >> without requiring the CSV parser. It will make it much easier to directly >> use json formatted export data for folks going forward. This seemed like the >> simplest route. > > This isn't a bug. Neither CSV format nor TEXT format are partucularly > suitable for json. I'm quite certain I could compose legal json that will > break your proposal (for example, with an embedded newline in the white > space.)
Sorry - though I originally reported it as a bug, I didn’t mean to imply it ultimately was one. :) The patch is a feature enhancement. > It's also unnecessary. CSV format, while not designed for this, is > nevertheless sufficiently flexible to allow successful import of json data > meeting certain criteria (essentially no newlines), like this: > > copy the_table(jsonfield) > from '/path/to/jsondata' > csv quote e'\x01' delimiter e'\x02’; While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s a no brainer to merge (it actually makes the code smaller by 3 lines). It also enables non-json use cases anytime one might want to preserve embedded escapes, or use different ones entirely. Do you see other reasons not to commit it? > You aren't the first person to encounter this problem. See > <http://adpgtech.blogspot.com/2014/09/importing-json-data.html > <http://adpgtech.blogspot.com/2014/09/importing-json-data.html>> > > Maybe we need to add something like this to the docs, or to the wiki. In your post you acknowledged a text mode copy with null escape character would have solved your problem, and to me, that was the intuitive/obvious choice as well. *shrug* > Note too my comment in that blog post: > > Now this solution is a bit of a hack. I wonder if there's a case for > a COPY mode that simply treats each line as a single datum. I also > wonder if we need some more specialized tools for importing JSON, > possibly one or more Foreign Data Wrappers. Such things could > handle, say, embedded newline punctuation. Agree. Though https://github.com/citusdata/json_fdw <https://github.com/citusdata/json_fdw> already exists (I haven’t used it). How do folks feel about a COPY mode that reads a single datum until it finds a single character record terminator alone on a line? something like: =# COPY test(data) from stdin terminator ‘}’; End with a backslash and a period on a line by itself. >>{ >> "a": 123, "c": "string", "b": [1, 2, 3 >> ] >>} >>{ >> [1,2,3] >>} >>\. COPY 2 You could also get fancy and support multi-character record terminators which would allow the same thing in a slightly different way: COPY test(data) from stdin terminator ‘\n}’; COPY test(data) from stdin terminator ‘\n}’; I don’t know how crazy you could get without a lot of rework. It might have to be used in conjunction with a more constrained mode like "COPY…RAW”. Aaron