Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul25, 2011, at 02:03 , Florian Pflug wrote: > On Jul25, 2011, at 00:48 , Joey Adams wrote: >> Should we follow the JavaScript standard for rendering numbers (which >> my suggestion approximates)? Or should we use the shortest encoding >> as Florian suggests? > > In the light of the above, consider my suggestion withdrawn. I now think > we should just follow the JavaScript standard as closely as possible. > As you said, it's pretty much the same as your suggestion, just more precise > in the handling of some corner-cases like infinity, nan, +/-0, some > questions of leading and trailing zeros, ... Just FYI, I browsed through the ECMA Standard you referenced again, and realized that they explicitly forbid JSON numeric values to be NaN or (-)Infinity (Page 205, Step 9 at the top of the page). RFC 4627 seems to take the same stand. I fail to see the wisdom in that, but it's what the standard says... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul25, 2011, at 07:35 , Joey Adams wrote: > On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams > wrote: >> Should we mimic IEEE floats and preserve -0 versus +0 while treating >> them as equal? Or should we treat JSON floats like numeric and >> convert -0 to 0 on input? Or should we do something else? I think >> converting -0 to 0 would be a bad idea, as it would violate the >> intuitive assumption that JSON can be used to marshal double-precision >> floats. > > On the other hand, JavaScript's own .toString and JSON.stringify turn > -0 into 0, so JSON can't marshal -0 around, anyway (in practice). Now > I think turning -0 into 0 would be fine for canonicalizing numbers in > json_in. +1. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams wrote: > Should we mimic IEEE floats and preserve -0 versus +0 while treating > them as equal? Or should we treat JSON floats like numeric and > convert -0 to 0 on input? Or should we do something else? I think > converting -0 to 0 would be a bad idea, as it would violate the > intuitive assumption that JSON can be used to marshal double-precision > floats. On the other hand, JavaScript's own .toString and JSON.stringify turn -0 into 0, so JSON can't marshal -0 around, anyway (in practice). Now I think turning -0 into 0 would be fine for canonicalizing numbers in json_in. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug wrote: > The downside being that we'd then either need to canonicalize in > the equality operator, or live with either no equality operator or > a rather strange one. It just occurred to me that, even if we sort object members, texteq might not be a sufficient way to determine equality. In particular, IEEE floats treat +0 and -0 as two different things, but they are equal when compared. Note that we're only dealing with a decimal representation; we're not (currently) converting to double-precision representation and back. Should we mimic IEEE floats and preserve -0 versus +0 while treating them as equal? Or should we treat JSON floats like numeric and convert -0 to 0 on input? Or should we do something else? I think converting -0 to 0 would be a bad idea, as it would violate the intuitive assumption that JSON can be used to marshal double-precision floats. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul25, 2011, at 00:48 , Joey Adams wrote: > On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug wrote: >> On Jul24, 2011, at 05:14 , Robert Haas wrote: >>> On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams >>> wrote: ... Fortunately, JSON's definition of a "number" is its decimal syntax, so the algorithm is child's play: * Figure out the digits and exponent. * If the exponent is greater than 20 or less than 6 (arbitrary), use exponential notation. >> >> I agree. As for your proposed algorithm, I suggest to instead use >> exponential notation if it produces a shorter textual representation. >> In other words, for values between -1 and 1, we'd switch to exponential >> notation if there's more than 1 leading zero (to the right of the decimal >> point, of course), and for values outside that range if there're more than >> 2 trailing zeros and no decimal point. All after redundant zeros and >> decimal points are removed. So we'd store >> >> 0 as 0 >> 1 as 1 >> 0.1 as 0.1 >> 0.01 as 0.01 >> 0.001 as 1e-3 >> 10 as 10 >> 100 as 100 >> 1000 as 1e3 >> 1000.1 as 1000.1 >> 1001 as 1001 > > Interesting idea. The reason I suggested using exponential notation > only for extreme exponents (less than -6 or greater than +20) is > partly for presentation value. Users might be annoyed to see 100 > turned into 1e6. I'm not concerned about that, but ... > Moreover, applications working solely with integers > that don't expect the floating point syntax may choke on the converted > numbers. now that you say it, that's definitely a concern. > 32-bit integers can be losslessly encoded as IEEE > double-precision floats (JavaScript's internal representation), and > JavaScript's algorithm for converting a number to a string ([1], > section 9.8.1) happens to preserve the integer syntax (I think). Indeed. In fact, it seems to be designed to use the integer syntax for all integral values with <= 66 binary digits. log10(2^66) ~ 19.87 > Should we follow the JavaScript standard for rendering numbers (which > my suggestion approximates)? Or should we use the shortest encoding > as Florian suggests? In the light of the above, consider my suggestion withdrawn. I now think we should just follow the JavaScript standard as closely as possible. As you said, it's pretty much the same as your suggestion, just more precise in the handling of some corner-cases like infinity, nan, +/-0, some questions of leading and trailing zeros, ... I wouldn't have made my suggestion had I realized earlier that limit of 20 for the exponent was carefully chosen to ensure that the full range of a 64-bit integer value would be represented in non-exponential notation. I assumed the bracketed "arbitrary" in your description applied to both the upper (20) as well as the lower (-6) bound, when it really only applies to the lower bound. Sorry for that. (I am now curious where the seemingly arbitrary lower bound of -6 comes from, though. The only explanation I can come up with is that somebody figured that 0.01 is still easily distinguished visually from 0.1, but not so much from 0.001) best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Sat, Jul 23, 2011 at 11:14 PM, Robert Haas wrote: > I doubt you're going to want to reinvent TOAST, ... I was thinking about making it efficient to access or update foo.a.b.c.d[1000] in a huge JSON tree. Simply TOASTing the varlena text means we have to unpack the entire datum to access and update individual members. An alternative would be to split the JSON into chunks (possibly by using the pg_toast_ table) and have some sort of index that can be used to efficiently look up values by path. This would not be trivial, and I don't plan to implement it any time soon. > On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug wrote: > On Jul24, 2011, at 05:14 , Robert Haas wrote: >> On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams >> wrote: >>> ... Fortunately, JSON's definition of a >>> "number" is its decimal syntax, so the algorithm is child's play: >>> >>> * Figure out the digits and exponent. >>> * If the exponent is greater than 20 or less than 6 (arbitrary), use >>> exponential notation. >>> >> > > I agree. As for your proposed algorithm, I suggest to instead use > exponential notation if it produces a shorter textual representation. > In other words, for values between -1 and 1, we'd switch to exponential > notation if there's more than 1 leading zero (to the right of the decimal > point, of course), and for values outside that range if there're more than > 2 trailing zeros and no decimal point. All after redundant zeros and > decimal points are removed. So we'd store > > 0 as 0 > 1 as 1 > 0.1 as 0.1 > 0.01 as 0.01 > 0.001 as 1e-3 > 10 as 10 > 100 as 100 > 1000 as 1e3 > 1000.1 as 1000.1 > 1001 as 1001 > Interesting idea. The reason I suggested using exponential notation only for extreme exponents (less than -6 or greater than +20) is partly for presentation value. Users might be annoyed to see 100 turned into 1e6. Moreover, applications working solely with integers that don't expect the floating point syntax may choke on the converted numbers. 32-bit integers can be losslessly encoded as IEEE double-precision floats (JavaScript's internal representation), and JavaScript's algorithm for converting a number to a string ([1], section 9.8.1) happens to preserve the integer syntax (I think). Should we follow the JavaScript standard for rendering numbers (which my suggestion approximates)? Or should we use the shortest encoding as Florian suggests? - Joey [1]: http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262%205th%20edition%20December%202009.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul24, 2011, at 05:14 , Robert Haas wrote: > On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams > wrote: >> Interesting. This leads to a couple more questions: >> >> * Should the JSON data type (eventually) have an equality operator? > > +1. +1. >> * Should the JSON input function alphabetize object members by key? > > I think it would probably be better if it didn't. I'm wary of > overcanonicalization. It can be useful to have things come back out > in more or less the format you put them in. The downside being that we'd then either need to canonicalize in the equality operator, or live with either no equality operator or a rather strange one. Also, if we don't canonicalize now, we (or rather our users) are in for some pain should we ever decide to store JSON values in some other form than plain text. Because if we do that, we'd presumably want to order the members in some predefined way (by their hash value, for example). So, from me "+1" for alphabetizing members. >> If we canonicalize strings and numbers and alphabetize object members, >> then our equality function is just texteq. The only stumbling block >> is canonicalizing numbers. Fortunately, JSON's definition of a >> "number" is its decimal syntax, so the algorithm is child's play: >> >> * Figure out the digits and exponent. >> * If the exponent is greater than 20 or less than 6 (arbitrary), use >> exponential notation. >> >> The problem is: 2.718282e-1000 won't equal 0 as may be expected. I >> doubt this matters much. > > I don't think that 2.718282e-100 SHOULD equal 0. I agree. As for your proposed algorithm, I suggest to instead use exponential notation if it produces a shorter textual representation. In other words, for values between -1 and 1, we'd switch to exponential notation if there's more than 1 leading zero (to the right of the decimal point, of course), and for values outside that range if there're more than 2 trailing zeros and no decimal point. All after redundant zeros and decimal points are removed. So we'd store 0 as 0 1 as 1 0.1 as 0.1 0.01 as 0.01 0.001 as 1e-3 10 as 10 100 as 100 1000 as 1e3 1000.1 as 1000.1 1001 as 1001 >> If, in the future, we add the ability to manipulate large JSON trees >> efficiently (e.g. by using an auxiliary table like TOAST does), we'll >> probably want unique members, so enforcing them now may be prudent. > > I doubt you're going to want to reinvent TOAST, but I do think there > are many advantages to forbidding duplicate keys. ISTM the question > is whether to throw an error or just silently discard one of the k/v > pairs. Keeping both should not be on the table, IMHO. +1. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams wrote: > Interesting. This leads to a couple more questions: > > * Should the JSON data type (eventually) have an equality operator? +1. > * Should the JSON input function alphabetize object members by key? I think it would probably be better if it didn't. I'm wary of overcanonicalization. It can be useful to have things come back out in more or less the format you put them in. > If we canonicalize strings and numbers and alphabetize object members, > then our equality function is just texteq. The only stumbling block > is canonicalizing numbers. Fortunately, JSON's definition of a > "number" is its decimal syntax, so the algorithm is child's play: > > * Figure out the digits and exponent. > * If the exponent is greater than 20 or less than 6 (arbitrary), use > exponential notation. > > The problem is: 2.718282e-1000 won't equal 0 as may be expected. I > doubt this matters much. I don't think that 2.718282e-100 SHOULD equal 0. > If, in the future, we add the ability to manipulate large JSON trees > efficiently (e.g. by using an auxiliary table like TOAST does), we'll > probably want unique members, so enforcing them now may be prudent. I doubt you're going to want to reinvent TOAST, but I do think there are many advantages to forbidding duplicate keys. ISTM the question is whether to throw an error or just silently discard one of the k/v pairs. Keeping both should not be on the table, IMHO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Also, should I forbid the escape \u (in all database encodings)? Pros: * If \u is forbidden, and the server encoding is UTF-8, then every JSON-wrapped string will be convertible to TEXT. * It will be consistent with the way PostgreSQL already handles text, and with the decision to use database-encoded JSON strings. * Some applications choke on strings with null characters. For example, in some web browsers but not others, if you pass "Hello\uworld" to document.write() or assign it to a DOM object's innerHTML, it will be truncated to "Hello". By banning \u, users can catch such rogue strings early. * It's a little easier to represent internally. Cons: * Means JSON type will accept a subset of the JSON described in RFC4627. However, the RFC does say "An implementation may set limits on the length and character contents of strings", so we can arguably get away with banning \u while being law-abiding citizens. * Being able to store U+–U+00FF means users can use JSON strings to hold binary data: by treating it as Latin-1. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Fri, Jul 22, 2011 at 7:12 PM, Robert Haas wrote: > Hmm. That's tricky. I lean mildly toward throwing an error as being > more consistent with the general PG philosophy. I agree. Besides, throwing an error on duplicate keys seems like the most logical thing to do. The most compelling reason not to, I think, is that it would make the input function a little slower. On Fri, Jul 22, 2011 at 8:26 PM, Florian Pflug wrote: >> * The XML type doesn't have this restriction (it just stores the >> input text verbatim, and converts it to UTF-8 before doing anything >> complicated with it). > > Yeah. But the price the XML type pays for that is the lack of an > equality operator. Interesting. This leads to a couple more questions: * Should the JSON data type (eventually) have an equality operator? * Should the JSON input function alphabetize object members by key? If we canonicalize strings and numbers and alphabetize object members, then our equality function is just texteq. The only stumbling block is canonicalizing numbers. Fortunately, JSON's definition of a "number" is its decimal syntax, so the algorithm is child's play: * Figure out the digits and exponent. * If the exponent is greater than 20 or less than 6 (arbitrary), use exponential notation. The problem is: 2.718282e-1000 won't equal 0 as may be expected. I doubt this matters much. It would be nice to canonicalize JSON on input, and that's the way I'd like to go, but two caveats are: * Input (and other operations) would require more CPU time. Instead of being able to pass the data through a quick condense function, it'd have to construct an AST (to sort object members) and re-encode the JSON back into a string. * Users, for aesthetic reasons, might not want their JSON members rearranged. If, in the future, we add the ability to manipulate large JSON trees efficiently (e.g. by using an auxiliary table like TOAST does), we'll probably want unique members, so enforcing them now may be prudent. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul23, 2011, at 00:04 , Joey Adams wrote: > I think I've decided to only allow escapes of non-ASCII characters > when the database encoding is UTF8. For example, $$"\u2013"$$::json > will fail if the database encoding is WIN1252, even though WIN1252 can > encode U+2013 (EN DASH). This may be somewhat draconian, given that: > > * SQL_ASCII can otherwise handle "any" language according to the > documentation. +1. It makes the handling if \u sequences consistent with the behaviour of CHR(), which seems like a Good Thing. Clients can also work around this restriction be de-escaping themselves, which shouldn't be too hard. > * The XML type doesn't have this restriction (it just stores the > input text verbatim, and converts it to UTF-8 before doing anything > complicated with it). Yeah. But the price the XML type pays for that is the lack of an equality operator. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul23, 2011, at 01:12 , Robert Haas wrote: > On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams > wrote: >> On another matter, should the JSON type guard against duplicate member >> keys? The JSON RFC says "The names within an object SHOULD be >> unique," meaning JSON with duplicate members can be considered valid. >> JavaScript interpreters (the ones I tried), PHP, and Python all have >> the same behavior: discard the first member in favor of the second. >> That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an >> error if a duplicate attribute is present (e.g. '> href="c"/>'::xml). > > Hmm. That's tricky. I lean mildly toward throwing an error as being > more consistent with the general PG philosophy. I'm usually all for throwing an error on ambiguous input - but if Javascript, PHP and Python all agree, it might be wise to just yield to them. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Fri, Jul 22, 2011 at 7:16 PM, Jan Urbański wrote: > On 23/07/11 01:12, Robert Haas wrote: >> On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams >> wrote: >>> On another matter, should the JSON type guard against duplicate member >>> keys? The JSON RFC says "The names within an object SHOULD be >>> unique," meaning JSON with duplicate members can be considered valid. >>> JavaScript interpreters (the ones I tried), PHP, and Python all have >>> the same behavior: discard the first member in favor of the second. >>> That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an >>> error if a duplicate attribute is present (e.g. '>> href="c"/>'::xml). >> >> Hmm. That's tricky. I lean mildly toward throwing an error as being >> more consistent with the general PG philosophy. > > OTOH: > > regression=# select 'key=>1,key=>2'::hstore; > hstore > > "key"=>"1" > (1 row) Fair point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On 23/07/11 01:12, Robert Haas wrote: > On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams > wrote: >> On another matter, should the JSON type guard against duplicate member >> keys? The JSON RFC says "The names within an object SHOULD be >> unique," meaning JSON with duplicate members can be considered valid. >> JavaScript interpreters (the ones I tried), PHP, and Python all have >> the same behavior: discard the first member in favor of the second. >> That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an >> error if a duplicate attribute is present (e.g. '> href="c"/>'::xml). > > Hmm. That's tricky. I lean mildly toward throwing an error as being > more consistent with the general PG philosophy. OTOH: regression=# select 'key=>1,key=>2'::hstore; hstore "key"=>"1" (1 row) Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams wrote: > On another matter, should the JSON type guard against duplicate member > keys? The JSON RFC says "The names within an object SHOULD be > unique," meaning JSON with duplicate members can be considered valid. > JavaScript interpreters (the ones I tried), PHP, and Python all have > the same behavior: discard the first member in favor of the second. > That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an > error if a duplicate attribute is present (e.g. ' href="c"/>'::xml). Hmm. That's tricky. I lean mildly toward throwing an error as being more consistent with the general PG philosophy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
I think I've decided to only allow escapes of non-ASCII characters when the database encoding is UTF8. For example, $$"\u2013"$$::json will fail if the database encoding is WIN1252, even though WIN1252 can encode U+2013 (EN DASH). This may be somewhat draconian, given that: * SQL_ASCII can otherwise handle "any" language according to the documentation. * The XML type doesn't have this restriction (it just stores the input text verbatim, and converts it to UTF-8 before doing anything complicated with it). However, it's simple to implement and understand. The JSON data type will not perform any automatic conversion between character encodings. Also, if we want to handle this any better in the future, we won't have to support legacy data containing a mixture of encodings. In the future, we could create functions to compensate for the issues people encounter; for example: * json_escape_unicode(json [, replace bool]) returns text -- convert non-ASCII characters to escapes. Optionally, use \uFFFD for unconvertible characters. * json_unescape_unicode(text [, replace text]) returns json -- like json_in, but convert Unicode escapes to characters when possible. Optionally, replace unconvertible characters with a given string. I've been going back and forth on how to handle encodings in the JSON type for a while, but suggestions and objections are still welcome. However, I plan to proceed in this direction so progress can be made. On another matter, should the JSON type guard against duplicate member keys? The JSON RFC says "The names within an object SHOULD be unique," meaning JSON with duplicate members can be considered valid. JavaScript interpreters (the ones I tried), PHP, and Python all have the same behavior: discard the first member in favor of the second. That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an error if a duplicate attribute is present (e.g. ''::xml). Thanks for the input, - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Wed, Jul 20, 2011 at 6:49 AM, Florian Pflug wrote: > Hm, I agree that we need to handle \u escapes in JSON input. > We won't ever produce them during output though, right? We could, to prevent transcoding errors if the client encoding is different than the server encoding (and neither is SQL_ASCII, nor is the client encoding UTF8). For example, if the database encoding is UTF-8 and the client encoding is WIN1252, I'd think it would be a good idea to escape non-ASCII characters. > How does that XML type handle the situation? It seems that it'd have > the same problem with unicode entity references "". From the looks of it, XML operations convert the text to UTF-8 before passing it to libxml. The XML type does not normalize the input; SELECT '♫♫'::xml; simply yields ♫♫. Escapes of any character are allowed in any encoding, from the looks of it. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul20, 2011, at 06:40 , Joey Adams wrote: > On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas wrote: >>> Thanks for the input. I'm leaning in this direction too. However, it >>> will be a tad tricky to implement the conversions efficiently, ... >> >> I'm a bit confused, because I thought what I was talking about was not >> doing any conversions in the first place. > > We want to be able to handle \u escapes when the database encoding > is not UTF-8. We could leave them in place, but sooner or later > they'll need to be converted in order to unwrap or compare JSON > strings. Hm, I agree that we need to handle \u escapes in JSON input. We won't ever produce them during output though, right? > The approach being discussed is converting escapes to the database > encoding. This means escapes of characters not available in the > database encoding (e.g. \u266B in ISO-8859-1) will be forbidden. > > The PostgreSQL parser (which also supports Unicode escapes) takes a > simpler approach: don't allow non-ASCII escapes unless the server > encoding is UTF-8. Maybe JSON should simply reject them too, then. At least for now. How does that XML type handle the situation? It seems that it'd have the same problem with unicode entity references "". best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas wrote: >> Thanks for the input. I'm leaning in this direction too. However, it >> will be a tad tricky to implement the conversions efficiently, ... > > I'm a bit confused, because I thought what I was talking about was not > doing any conversions in the first place. We want to be able to handle \u escapes when the database encoding is not UTF-8. We could leave them in place, but sooner or later they'll need to be converted in order to unwrap or compare JSON strings. The approach being discussed is converting escapes to the database encoding. This means escapes of characters not available in the database encoding (e.g. \u266B in ISO-8859-1) will be forbidden. The PostgreSQL parser (which also supports Unicode escapes) takes a simpler approach: don't allow non-ASCII escapes unless the server encoding is UTF-8. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Tue, Jul 19, 2011 at 9:03 PM, Joey Adams wrote: > On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug wrote: >> On Jul19, 2011, at 00:17 , Joey Adams wrote: >>> I suppose a simple solution would be to convert all escapes and >>> outright ban escapes of characters not in the database encoding. >> >> +1. Making JSON work like TEXT when it comes to encoding issues >> makes this all much simpler conceptually. It also avoids all kinds >> of weird issues if you extract textual values from a JSON document >> server-side. > > Thanks for the input. I'm leaning in this direction too. However, it > will be a tad tricky to implement the conversions efficiently, ... I'm a bit confused, because I thought what I was talking about was not doing any conversions in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Bruce Momjian wrote: > Joey Adams wrote: > > Forwarding because the mailing list rejected the original message. > > Yes, I am seeing email failures to the 'core' email list. Marc says it is now fixed. --- > > > > > -- Forwarded message -- > > From: Joey Adams > > Date: Tue, Jul 19, 2011 at 11:23 PM > > Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS] > > Another swing at JSON > > To: Alvaro Herrera > > Cc: Florian Pflug , Tom Lane , Robert > > Haas , Bernd Helmle , > > Dimitri Fontaine , David Fetter > > , Josh Berkus , Pg Hackers > > > > > > > > On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera > > wrote: > > > Would it work to have a separate entry point into mbutils.c that lets > > > you cache the conversion proc caller-side? > > > > That sounds like a really good idea. ?There's still the overhead of > > calling the proc, but I imagine it's a lot less than looking it up. > > > > > I think the main problem is > > > determining the byte length of each source character beforehand. > > > > I'm not sure what you mean. ?The idea is to convert the \u escape > > to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8 > > sequence is easy to compute), call the conversion proc to get the > > null-terminated database-encoded character, then append the result to > > whatever StringInfo the string is going into. > > > > The only question mark is how big the destination buffer will need to > > be. ?The maximum number of bytes per char in any supported encoding is > > 4, but is it possible for one Unicode character to turn into multiple > > "character"s in the database encoding? > > > > While we're at it, should we provide the same capability to the SQL > > parser? ?Namely, the ability to use \u escapes above U+007F when > > the server encoding is not UTF-8? > > > > - Joey > > > > -- > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-hackers > > -- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Joey Adams wrote: > Forwarding because the mailing list rejected the original message. Yes, I am seeing email failures to the 'core' email list. --- > > -- Forwarded message -- > From: Joey Adams > Date: Tue, Jul 19, 2011 at 11:23 PM > Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS] > Another swing at JSON > To: Alvaro Herrera > Cc: Florian Pflug , Tom Lane , Robert > Haas , Bernd Helmle , > Dimitri Fontaine , David Fetter > , Josh Berkus , Pg Hackers > > > > On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera > wrote: > > Would it work to have a separate entry point into mbutils.c that lets > > you cache the conversion proc caller-side? > > That sounds like a really good idea. ?There's still the overhead of > calling the proc, but I imagine it's a lot less than looking it up. > > > I think the main problem is > > determining the byte length of each source character beforehand. > > I'm not sure what you mean. ?The idea is to convert the \u escape > to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8 > sequence is easy to compute), call the conversion proc to get the > null-terminated database-encoded character, then append the result to > whatever StringInfo the string is going into. > > The only question mark is how big the destination buffer will need to > be. ?The maximum number of bytes per char in any supported encoding is > 4, but is it possible for one Unicode character to turn into multiple > "character"s in the database encoding? > > While we're at it, should we provide the same capability to the SQL > parser? ?Namely, the ability to use \u escapes above U+007F when > the server encoding is not UTF-8? > > - Joey > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Forwarding because the mailing list rejected the original message. -- Forwarded message -- From: Joey Adams Date: Tue, Jul 19, 2011 at 11:23 PM Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON To: Alvaro Herrera Cc: Florian Pflug , Tom Lane , Robert Haas , Bernd Helmle , Dimitri Fontaine , David Fetter , Josh Berkus , Pg Hackers On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera wrote: > Would it work to have a separate entry point into mbutils.c that lets > you cache the conversion proc caller-side? That sounds like a really good idea. There's still the overhead of calling the proc, but I imagine it's a lot less than looking it up. > I think the main problem is > determining the byte length of each source character beforehand. I'm not sure what you mean. The idea is to convert the \u escape to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8 sequence is easy to compute), call the conversion proc to get the null-terminated database-encoded character, then append the result to whatever StringInfo the string is going into. The only question mark is how big the destination buffer will need to be. The maximum number of bytes per char in any supported encoding is 4, but is it possible for one Unicode character to turn into multiple "character"s in the database encoding? While we're at it, should we provide the same capability to the SQL parser? Namely, the ability to use \u escapes above U+007F when the server encoding is not UTF-8? - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Excerpts from Joey Adams's message of mar jul 19 21:03:15 -0400 2011: > On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug wrote: > > On Jul19, 2011, at 00:17 , Joey Adams wrote: > >> I suppose a simple solution would be to convert all escapes and > >> outright ban escapes of characters not in the database encoding. > > > > +1. Making JSON work like TEXT when it comes to encoding issues > > makes this all much simpler conceptually. It also avoids all kinds > > of weird issues if you extract textual values from a JSON document > > server-side. > > Thanks for the input. I'm leaning in this direction too. However, it > will be a tad tricky to implement the conversions efficiently, since > the wchar API doesn't provide a fast path for individual codepoint > conversion (that I'm aware of), and pg_do_encoding_conversion doesn't > look like a good thing to call lots of times. > > My plan is to scan for escapes of non-ASCII characters, convert them > to UTF-8, and put them in a comma-delimited string like this: > > a,b,c,d, > > then, convert the resulting string to the server encoding (which may > fail, indicating that some codepoint(s) are not present in the > database encoding). After that, read the string and plop the > characters where they go. Ugh. > It's "clever", but I can't think of a better way to do it with the existing > API. Would it work to have a separate entry point into mbutils.c that lets you cache the conversion proc caller-side? I think the main problem is determining the byte length of each source character beforehand. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug wrote: > On Jul19, 2011, at 00:17 , Joey Adams wrote: >> I suppose a simple solution would be to convert all escapes and >> outright ban escapes of characters not in the database encoding. > > +1. Making JSON work like TEXT when it comes to encoding issues > makes this all much simpler conceptually. It also avoids all kinds > of weird issues if you extract textual values from a JSON document > server-side. Thanks for the input. I'm leaning in this direction too. However, it will be a tad tricky to implement the conversions efficiently, since the wchar API doesn't provide a fast path for individual codepoint conversion (that I'm aware of), and pg_do_encoding_conversion doesn't look like a good thing to call lots of times. My plan is to scan for escapes of non-ASCII characters, convert them to UTF-8, and put them in a comma-delimited string like this: a,b,c,d, then, convert the resulting string to the server encoding (which may fail, indicating that some codepoint(s) are not present in the database encoding). After that, read the string and plop the characters where they go. It's "clever", but I can't think of a better way to do it with the existing API. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Jul19, 2011, at 00:17 , Joey Adams wrote: > I suppose a simple solution would be to convert all escapes and > outright ban escapes of characters not in the database encoding. +1. Making JSON work like TEXT when it comes to encoding issues makes this all much simpler conceptually. It also avoids all kinds of weird issues if you extract textual values from a JSON document server-side. If we really need more flexibility than that, we should look at ways to allow different columns to have different encodings. Doing that just for JSON seems wrongs - especially because doesn't really reduce the complexity of the problem, as your examples shows. The essential problem here is, AFAICS, that there's really no sane way to compare strings in two different encodings, unless both encode a subset of unicode only. > This would have the nice property that all strings can be unescaped > server-side. Problem is, what if a browser or other program produces, > say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database > where the encoding lacks this character? They'll get an error - just as if they had tried to store that same character in a TEXT column. > On the other hand, converting all JSON to UTF-8 would be simpler to > implement. It would probably be more intuitive, too, given that the > JSON RFC says, "JSON text SHALL be encoded in Unicode." Yet, they only I reason I'm aware of for some people to not use UTF-8 as the server encoding is that it's pretty inefficient storage-wise for some scripts (AFAIR some japanese scripts are an example, but I don't remember the details). By making JSON store UTF-8 on-disk always, the JSON type gets less appealing to those people. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane wrote: > BTW, could the \u problem be finessed by leaving such escapes in > source form? Yes, it could. However, it doesn't solve the problem of comparison (needed for member lookup), which requires canonicalizing the strings to be compared. Here's a Unicode handling policy I came up with. It is guaranteed to be lossless as long as the client and database encodings are the same. --- On input (json_in), if the text is valid JSON, it is condensed: * Whitespace characters surrounding tokens are removed. * Long escapes (like \u0022) are converted to short escapes (like \") where possible. * Unnecessary escapes of ASCII characters (e.g. \u0061 and even \u007F) are converted to their respective characters. * Escapes of non-ASCII characters (e.g. \u0080, \u266B, \uD834\uDD1E) are converted to their respective characters, but only if the database encoding is UTF-8. On output (json_out), non-ASCII characters are converted to \u escapes, unless one or more of these very likely circumstances hold: * The client encoding and database encoding are the same. No conversion is performed, so escaping characters will not prevent any conversion errors. * The client encoding is UTF-8. Escaping is not necessary because the client can encode all Unicode codepoints. * The client encoding and/or database encoding is SQL_ASCII. SQL_ASCII tells PostgreSQL to shirk transcoding in favor of speed. When a JSON-encoded string is unwrapped using from_json (e.g. from_json($$ "\u00A1Hola!" $$)), escapes will be converted to the characters they represent. If any escapes cannot be represented in the database encoding, an error will be raised. Note that: * If the database encoding is UTF-8, conversion will never fail. * If the database encoding is SQL_ASCII, conversion will fail if any escapes of non-ASCII characters are present. --- However, I'm having a really hard time figuring out how comparison would work in this framework. Here are a few options: 1. Convert the strings to UTF-8, convert the escapes to characters, and compare the strings. 2. Convert the escapes to the database encoding, then compare the strings. 3. If either string contains escapes of non-ASCII characters, do 1. Otherwise, do 2. Number 1 seems the most sane to me, but could lead to rare errors. Number 3 could produce confusing results. If character set X has three different representations of one Unicode codepoint, then we could have scenarios like this (simplified): "abc♫" != "aaa♫" but: "abc\u266b" == "aaa♫" I suppose a simple solution would be to convert all escapes and outright ban escapes of characters not in the database encoding. This would have the nice property that all strings can be unescaped server-side. Problem is, what if a browser or other program produces, say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database where the encoding lacks this character? On the other hand, converting all JSON to UTF-8 would be simpler to implement. It would probably be more intuitive, too, given that the JSON RFC says, "JSON text SHALL be encoded in Unicode." In any case, the documentation should say "Use UTF-8 for best results", as there seems to be no entirely satisfactory way to handle JSON in other database encodings. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane wrote: > Robert Haas writes: >> On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams >> wrote: >>> I'm having a really hard time figuring out how the JSON module should >>> handle non-Unicode character sets. > >> But, again, why not just forget about transcoding and define it as >> "JSON, if you happen to be using utf-8 as the server encoding, and >> otherwise some variant of JSON that uses the server encoding as its >> native format?". It seems to me that that would be a heck of a lot >> simpler and more reliable, and I'm not sure it's any less useful in >> practice. > > Right offhand, the only argument I can see against this is that we might > someday want to pass JSON datums directly to third-party loadable > libraries that are picky about UTF8-ness. But we could just document > and/or enforce that such libraries can only be used in UTF8-encoded > databases. Right. Or, if someone someday implements a feature to allow multiple server encodings within the same database, then we can tell people to use it if they want JSON to work in a spec-canonical way. > BTW, could the \u problem be finessed by leaving such escapes in > source form? Joey seems to want to canonicalize the input, which argues against that, and to detect invalid surrogate pairs, which likewise argues against that. You could argue that's overkill, but it seems to have at least some merit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Robert Haas writes: > On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams > wrote: >> I'm having a really hard time figuring out how the JSON module should >> handle non-Unicode character sets. > But, again, why not just forget about transcoding and define it as > "JSON, if you happen to be using utf-8 as the server encoding, and > otherwise some variant of JSON that uses the server encoding as its > native format?". It seems to me that that would be a heck of a lot > simpler and more reliable, and I'm not sure it's any less useful in > practice. Right offhand, the only argument I can see against this is that we might someday want to pass JSON datums directly to third-party loadable libraries that are picky about UTF8-ness. But we could just document and/or enforce that such libraries can only be used in UTF8-encoded databases. BTW, could the \u problem be finessed by leaving such escapes in source form? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams wrote: > On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams > wrote: >> I'll try to submit a revised patch within the next couple days. > > Sorry this is later than I said. > > I addressed the issues covered in the review. I also fixed a bug > where "\u0022" would become """, which is invalid JSON, causing an > assertion failure. > > However, I want to put this back into WIP for a number of reasons: > > * The current code accepts invalid surrogate pairs (e.g. > "\uD800\uD800"). The problem with accepting them is that it would be > inconsistent with PostgreSQL's Unicode support, and with the Unicode > standard itself. In my opinion: as long as the server encoding is > universal (i.e. UTF-8), decoding a JSON-encoded string should not fail > (barring data corruption and resource limitations). > > * I'd like to go ahead with the parser rewrite I mentioned earlier. > The new parser will be able to construct a parse tree when needed, and > it won't use those overkill parsing macros. > > * I recently learned that not all supported server encodings can be > converted to Unicode losslessly. The current code, on output, > converts non-ASCII characters to Unicode escapes under some > circumstances (see the comment above json_need_to_escape_unicode). > > I'm having a really hard time figuring out how the JSON module should > handle non-Unicode character sets. \u escapes in JSON literals > can be used to encode characters not available in the server encoding. > On the other hand, the server encoding can encode characters not > present in Unicode (see the third bullet point above). This means > JSON normalization and comparison (along with member lookup) are not > possible in general. I previously suggested that, instead of trying to implement JSON, you should just try to implement JSON-without-the-restriction-that-everything-must-be-UTF8. Most people are going to be using UTF-8 simply because it's the default, and if you forget about transcoding then ISTM that this all becomes a lot simpler. We don't, in general, have the ability to support data in multiple encodings inside PostgreSQL, and it seems to me that by trying to invent a mechanism for making that work as part of this patch, you are setting the bar for yourself awfully high. One thing to think about here is that transcoding between UTF-8 and the server encoding seems like the wrong thing all around. After all, the user does not want the data in the server encoding; they want it in their chosen client encoding. If you are transcoding between UTF-8 and the server encoding, then that suggests that there's some double-transcoding going on here, which creates additional opportunities for (1) inefficiency and (2) outright failure. I'm guessing that's because you're dealing with an interface that expects the internal representation of the datum on one side and the server encoding on the other side, which gets back to the point in the preceding paragraph. You'd probably need to revise that interface in order to make this really work the way it should, and that might be more than you want to get into. At any rate, it probably is a separate project from making JSON work. If in spite of the above you're bent on continuing down your present course, then it seems to me that you'd better make the on-disk representation UTF-8, with all \u escapes converted to the corresponding characters. If you hit an invalid surrogate pair, or a character that exists in the server encoding but not UTF-8, it's not a legal JSON object and you throw an error on input, just as you would for mismatched braces or similar. On output, you should probably just use \u to represent any unrepresentable characters - i.e. option 3 from your original list. That may be slow, but I think that it's not really worth devoting a lot of mental energy to this case. Most people are going to be using UTF-8 because that's the default, and those who are not shouldn't expect a data format built around UTF-8 to work perfectly in their environment, especially if they insist on using characters that are representable in only some of the encodings they are using. But, again, why not just forget about transcoding and define it as "JSON, if you happen to be using utf-8 as the server encoding, and otherwise some variant of JSON that uses the server encoding as its native format?". It seems to me that that would be a heck of a lot simpler and more reliable, and I'm not sure it's any less useful in practice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams wrote: > I'll try to submit a revised patch within the next couple days. Sorry this is later than I said. I addressed the issues covered in the review. I also fixed a bug where "\u0022" would become """, which is invalid JSON, causing an assertion failure. However, I want to put this back into WIP for a number of reasons: * The current code accepts invalid surrogate pairs (e.g. "\uD800\uD800"). The problem with accepting them is that it would be inconsistent with PostgreSQL's Unicode support, and with the Unicode standard itself. In my opinion: as long as the server encoding is universal (i.e. UTF-8), decoding a JSON-encoded string should not fail (barring data corruption and resource limitations). * I'd like to go ahead with the parser rewrite I mentioned earlier. The new parser will be able to construct a parse tree when needed, and it won't use those overkill parsing macros. * I recently learned that not all supported server encodings can be converted to Unicode losslessly. The current code, on output, converts non-ASCII characters to Unicode escapes under some circumstances (see the comment above json_need_to_escape_unicode). I'm having a really hard time figuring out how the JSON module should handle non-Unicode character sets. \u escapes in JSON literals can be used to encode characters not available in the server encoding. On the other hand, the server encoding can encode characters not present in Unicode (see the third bullet point above). This means JSON normalization and comparison (along with member lookup) are not possible in general. Even if I assume server -> UTF-8 -> server transcoding is lossless, the situation is still ugly. Normalization could be implemented a few ways: * Convert from server encoding to UTF-8, and convert \u escapes to UTF-8 characters. This is space-efficient, but the resulting text would not be compatible with the server encoding (which may or may not matter). * Convert from server encoding to UTF-8, and convert all non-ASCII characters to \u escapes, resulting in pure ASCII. This bloats the text by a factor of three, in the worst case. * Convert \u escapes to characters in the server encoding, but only where possible. This would be extremely inefficient. The parse tree (for functions that need it) will need to store JSON member names and strings either in UTF-8 or in normalized JSON (which could be the same thing). Help and advice would be appreciated. Thanks! - Joey json-contrib-rev1-20110714.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On 7/4/11 7:22 PM, Joseph Adams wrote: > I'll try to submit a revised patch within the next couple days. So? New patch? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Thanks for reviewing my patch! On Mon, Jul 4, 2011 at 7:10 AM, Bernd Helmle wrote: > +comment = 'data type for storing and manipulating JSON content' > > I'm not sure, if "manipulating" is a correct description. Maybe i missed it, > but i didn't see functions to manipulate JSON strings directly, for example, > adding an object to a JSON string, ... Indeed. I intend to add manipulation functions in the future. The words "and manipulating" shouldn't be there yet. > -- Coding > ... > ... It is noticable that the parser seems to define > its own is_space() and is_digit() functions, e.g.: > > +#define is_space(c) ((c) == '\t' || (c) == '\n' || (c) == '\r' || (c) == ' > ') > > Wouldn't it be more clean to use isspace() instead? isspace() accepts '\f', '\v', and sometimes other characters as well, depending on locale. Likewise, isdigit() accepts some non-ASCII characters in addition to [0-9], depending on the locale and platform. Thus, to avoid parsing a superset of the JSON spec, I can't use the functions. I should add a comment with this explanation. > This code block in function json_escape_unicode() looks suspicious: > > + /* Convert to UTF-8, if necessary. */ > + { > + const char *orig = json; > + json = (const char *) > + pg_do_encoding_conversion((unsigned char *) json, length, > + GetDatabaseEncoding(), PG_UTF8); > + if (json != orig) > + length = strlen(json); > + } > > Seems it *always* wants to convert the string. Isn't there a "if" condition > missing? pg_do_encoding_conversion does this check already. Perhaps I should rephrase the comment slightly: + /* Convert to UTF-8 (only if necessary). */ > There's a commented code fragment missing, which should be removed (see last > two lines of the snippet below): > > +static unsigned int > +read_hex16(const char *in) > ... > + Assert(is_hex_digit(c)); > + > + if (c >= '0' && c <= '9') > + tmp = c - '0'; > + else if (c >= 'A' && c <= 'F') > + tmp = c - 'A' + 10; > + else /* if (c >= 'a' && c <= 'f') */ > + tmp = c - 'a' + 10; That comment is there for documentation purposes, to indicate the range check that's skipped because we know c is a hex digit, and [a-f] is the only thing it could be (and must be) at that line. Should it still be removed? > json.c introduces new appendStringInfo* functions, e.g. > appendStringInfoEscape() and appendStringInfoUtf8(). Maybe it's better > to name them to something different, since it may occur someday that the > backend > will introduce the same notion with a different meaning. They are static, > but why not naming them into something like jsonAppendStringInfoUtf8() or > similar? I agree. > -- Regression Tests ... > It also tests UNICODE sequences and input encoding with other server > encoding than UTF-8. It tests with other *client* encodings than UTF-8, but not other server encodings than UTF-8. Is it possible to write tests for different server encodings? -- Self-review There's a minor bug in the normalization code: > select $$ "\u0009" $$::json; json -- "\u0009" (1 row) This should produce "\t" instead. I'm thinking that my expect_*/next_*pop_char/... parsing framework is overkill, and that, for a syntax as simple as JSON's, visibly messy parsing code like this: if (s < e && (*s == 'E' || *s == 'e')) { s++; if (s < e && (*s == '+' || *s == '-')) s++; if (!(s < e && is_digit(*s))) return false; do s++; while (s < e && is_digit(*s)); } would be easier to maintain than the deceptively elegant-looking code currently in place: if (optional_char_cond(s, e, *s == 'E' || *s == 'e')) { optional_char_cond(s, e, *s == '+' || *s == '-'); skip1_pred(s, e, is_digit); } I don't plan on gutting the current macro-driven parser just yet. When I do, the new parser will support building a parse tree (only when needed), and the parsing functions will have signatures like this: static bool parse_value(const char **sp, const char *e, JsonNode **out); static bool parse_string(const char **sp, const char *e, StringInfo *out); ... The current patch doesn't provide manipulation features where a parse tree would come in handy. I'd rather hold off on rewriting the parser until such features are added. I'll try to submit a revised patch within the next couple days. Thanks! - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
--On 18. Juni 2011 12:29:38 +0200 Bernd Helmle wrote: Similar problems occur with a couple other modules I tried (hstore, intarray). Hmm, works for me. Seems you have messed up your installation in some way (build against current -HEAD but running against a 9.1?). I'm going to review in the next couple of days again. A bit later than expected, but here is my summary on the patch: -- Patch Status The patch applies cleanly. Since it's primarily targeted as an addition to contrib/, it doesn't touch the backend source tree (besides documentation TOC entries), but adds a new subdirectory in contrib/json. The patch is in context diff as requested. -- Functionality The patch as is provides a basic implementation for a JSON datatype. It includes normalization and validation of JSON strings. It adds the datatype implementation as an extension. The implementation focus on the datatype functionality only, there is no additional support for special operators. Thus, i think the comments in the control file (and docs) promises actually more than the contrib module will deliver: +comment = 'data type for storing and manipulating JSON content' I'm not sure, if "manipulating" is a correct description. Maybe i missed it, but i didn't see functions to manipulate JSON strings directly, for example, adding an object to a JSON string, ... -- Coding The JSON datatype is implemented as a variable length character type, thus allows very large JSON strings to be stored. It transparently decides when to escape unicode code points depending on the selected client- and server-encoding. Validation is done through its own JSON parser. The parser itself is a recursive descent parser. It is noticable that the parser seems to define its own is_space() and is_digit() functions, e.g.: +#define is_space(c) ((c) == '\t' || (c) == '\n' || (c) == '\r' || (c) == ' ') Wouldn't it be more clean to use isspace() instead? This code block in function json_escape_unicode() looks suspicious: + /* Convert to UTF-8, if necessary. */ + { + const char *orig = json; + json = (const char *) + pg_do_encoding_conversion((unsigned char *) json, length, + GetDatabaseEncoding(), PG_UTF8); + if (json != orig) + length = strlen(json); + } Seems it *always* wants to convert the string. Isn't there a "if" condition missing? There's a commented code fragment missing, which should be removed (see last two lines of the snippet below): +static unsigned int +read_hex16(const char *in) +{ + unsigned int i; + unsigned int tmp; + unsigned int ret = 0; + + for (i = 0; i < 4; i++) + { + char c = *in++; + + Assert(is_hex_digit(c)); + + if (c >= '0' && c <= '9') + tmp = c - '0'; + else if (c >= 'A' && c <= 'F') + tmp = c - 'A' + 10; + else /* if (c >= 'a' && c <= 'f') */ + tmp = c - 'a' + 10; json.c introduces new appendStringInfo* functions, e.g. appendStringInfoEscape() and appendStringInfoUtf8(). Maybe it's better to name them to something different, since it may occur someday that the backend will introduce the same notion with a different meaning. They are static, but why not naming them into something like jsonAppendStringInfoUtf8() or similar? -- Regression Tests The patch includes a fairly complete regression test suite, which covers much of the formatting, validating and input functionality of the JSON datatype. It also tests UNICODE sequences and input encoding with other server encoding than UTF-8. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers