Re: [HACKERS] JSON data type status?

Joseph Adams Fri, 21 Jan 2011 00:59:57 -0800

On Thu, Jan 20, 2011 at 8:13 PM, Itagaki Takahiro
<[email protected]> wrote:
> On Fri, Jan 21, 2011 at 09:11, Bruce Momjian <[email protected]> wrote:
>> What happened to our work to add a JSON data type for PG 9.1?
>
> Nothing will happen in 9.1.
> I assume we are in "competition" status:
>
> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00481.php
>
> Also, if PGXN will work well, we might not have to include JSON
> in the core. We can download any JSON implementations from the
> site after installing the core server.  Of course, if we will
> use JSON types in the core (EXPLAIN JSON output?), we have to
> include one of them.


Thanks for the update.  However, I should probably take some blame as
well for spending basically zero time on the JSON datatype for lengths
at a time.

After someone emailed me asking how to install the JSON datatype, I
fixed up my module version of it:

http://git.postgresql.org/gitweb?p=json-datatype.git

I also have a version where JSON is a core datatype (the patches I
submitted to CommitFest), but it is obsolete now.  The module version
has all the same features, but also fixes a Unicode bug* and adds
PostgreSQL 8.4 compatibility.  For those who want the JSON datatype
right now, I recommend using the module.

I plan to maintain and improve the JSON module (repository linked
above), albeit at a snail's pace.  Patches are certainly welcome.  See
roadmap.markdown in the repository for some minor and not-so-minor
design decisions.


Joey Adams

* P.S.

The "Unicode bug" involved UTF-16 surrogate pair calculation.  JSON
encodes Unicode characters that aren't in the Basic Multilingual Plane
the same way UTF-16 does: using two 4-digit hex codes.  The correct
formula for determining the Unicode codepoint of a surrogate pair is:

    unicode = 0x10000 + (((uc & 0x3FF) << 10) | (lc & 0x3FF));

where:

    0xD800 <= uc <= 0xDBFF
    0xDC00 <= lc <= 0xDFFF

For example, consider the JSON string "\uD835\uDD0D":

    uc      = 0xD835
    lc      = 0xDD0D
    unicode = 0x1D50D

This pair of hex indices collapses into one Unicode codepoint, 0x1D50D
(or "𝔍").

I originally used this:

    /* WRONG */
    unicode = 0x10000 | ((uc & 0x3FF) << 10) | (lc & 0x3FF);

The problem is, ((uc & 0x3FF) << 10) | (lc & 0x3FF) has a range of 0
to 0xFFFFF, and in some of those numbers, `0x10000 |` adds zero.

I am sharing all this because I have made this mistake twice, and I
believe it is an easy mistake to make.

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JSON data type status?

Reply via email to