Re: [HACKERS] Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

Aaron Botsis Fri, 09 Jan 2015 08:54:12 -0800

> On Jan 7, 2015, at 2:45 PM, Andrew Dunstan <and...@dunslane.net> wrote:
> 
> 
> On 01/07/2015 08:25 AM, Aaron Botsis wrote:
>> Hi folks, I was having a problem importing json data with COPY. Lots of 
>> things export data nicely as one json blob per line. This is excellent for 
>> directly importing into a JSON/JSONB column for analysis.
>> 
>> ...Except when there’s an embedded doublequote. Or anything that’s escaped. 
>> COPY handles this, but by the time the escaped char hit the JSON parser, 
>> it's not escaped anymore. This breaks the JSON parsing. This means I need to 
>> manipulate the input data to double-escape it. See bug #12320 for an 
>> example. Yuck.
>> 
>> I propose this small patch that simply allows specifying COPY … ESCAPE 
>> without requiring the CSV parser. It will make it much easier to directly 
>> use json formatted export data for folks going forward. This seemed like the 
>> simplest route.
> 
> This isn't a bug. Neither CSV format nor TEXT format are partucularly 
> suitable for json. I'm quite certain I could compose legal json that will 
> break your proposal (for example, with an embedded newline in the white 
> space.)


Sorry - though I originally reported it as a bug, I didn’t mean to imply it 
ultimately was one. :) The patch is a feature enhancement.

> It's also unnecessary. CSV format, while not designed for this, is 
> nevertheless sufficiently flexible to allow successful import of json data 
> meeting certain criteria (essentially no newlines), like this:
> 
>   copy the_table(jsonfield)
>   from '/path/to/jsondata'
>   csv quote e'\x01' delimiter e'\x02’;

While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s 
a no brainer to merge (it actually makes the code smaller by 3 lines). It also 
enables non-json use cases anytime one might want to preserve embedded escapes, 
or use different ones entirely. Do you see other reasons not to commit it?

> You aren't the first person to encounter this problem. See 
> <http://adpgtech.blogspot.com/2014/09/importing-json-data.html 
> <http://adpgtech.blogspot.com/2014/09/importing-json-data.html>>
> 
> Maybe we need to add something like this to the docs, or to the wiki.

In your post you acknowledged a text mode copy with null escape character would 
have solved your problem, and to me, that was the intuitive/obvious choice as 
well. *shrug*

> Note too my comment in that blog post:
> 
>   Now this solution is a bit of a hack. I wonder if there's a case for
>   a COPY mode that simply treats each line as a single datum. I also
>   wonder if we need some more specialized tools for importing JSON,
>   possibly one or more Foreign Data Wrappers. Such things could
>   handle, say, embedded newline punctuation.

Agree. Though https://github.com/citusdata/json_fdw 
<https://github.com/citusdata/json_fdw> already exists (I haven’t used it). How 
do folks feel about a COPY mode that reads a single datum until it finds a 
single character record terminator alone on a line? something like:

=# COPY test(data) from stdin terminator ‘}’;
End with a backslash and a period on a line by itself.
>>{
>>  "a": 123, "c": "string",  "b": [1, 2, 3
>> ]
>>}
>>{
>>      [1,2,3]
>>}
>>\.
COPY 2

You could also get fancy and support multi-character record terminators which 
would allow the same thing in a slightly different way:
COPY test(data) from stdin terminator ‘\n}’;

COPY test(data) from stdin terminator ‘\n}’;

I don’t know how crazy you could get without a lot of rework. It might have to 
be used in conjunction with a more constrained mode like "COPY…RAW”.

Aaron

Re: [HACKERS] Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

Reply via email to