from:"Terry Laurenzo"

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-20 Thread Terry Laurenzo

I've got a new stripped down version of the binary json plugin on github:
https://github.com/tlaurenzo/pgjson

With all due warning of contrived benchmarks, I wrote some tests to see
where things stand. The test script is here:
https://github.com/tlaurenzo/pgjson/blob/master/testdocs/runbenchmark.sh
The results from my laptop (First gen Macbook Pro 32bit with Snow Leopard
and Postgresql 9.0) are here and copied into this email at the end:

https://github.com/tlaurenzo/pgjson/blob/master/testdocs/benchmark-results-2010-11-20-mbp32.txt

And for some commentary...
I copied the 5 sample documents from json.org's example section for these
tests. These are loaded into a table with a varchar column 1000 times each
(so the test table has 5000 rows in it). In all situations, the binary
encoding was smaller than the normalized text form (between 9 and 23%
smaller). I think there are cases where the binary form will be larger than
the corresponding text form, but I don't think they would be very common.

For the timings, various operations are performed on the 5000 row test table
for 100 iterations. In all situations, the query returns the Length of the
transformed document instead of the document itself so as to factor out
variable client IO between the test runs. The Null Parse is essentially
just a select from the table and therefore represents the baseline. The
times varied a little bit between runs but did not change materially.

What we see from this is that parsing JSON text and generating a binary
representation is cheap, representing approximately 10% of the base case
time. Conversely, anything that involves generating JSON text is expensive,
accounting for 30-40% of the base case time. Some incidental profiling
shows that while the entire operation is expensive, the process of
generating string literals dominates this time. There is likely room for
optimization in this method, but it should be noted that most of these
documents are lightly escaped (if escaped at all) which represents the happy
path through the string literal output function.

While I have not profiled any advanced manipulation of the binary structure
within the server, it stands to reason that manipulating the binary
structure should be significantly faster than an approach that requires
(perhaps multiple) transcoding between text representations in order to
complete a sequence of operations.

Assuming that the JSON datatype (at a minimum) normalizes text for storage,
then the text storage option accounts for about the most expensive path but
with none of the benefits of an internal binary form (smaller size, ability
to cheaply perform non-trivial manipulation within the database server).

Of course, just having a JSON datatype that blindly stores text will beat
everything, but I'm getting closer to thinking that the binary option is
worth the tradeoff.

Comments?
Terry

Running benchmark with 100 iterations per step
Loading data...
Data loaded.
=== DOCUMENT SIZE STATISTICS ===
Document Name | Original Size | Binary Size | Normalized Size |
Percentage Savings
--+---+-+-+
jsonorg_sample1.json | 582 | 311 | 360 |
13.6
jsonorg_sample2.json | 241 | 146 | 183 |
20.2185792349727
jsonorg_sample3.json | 601 | 326 | 389 |
16.1953727506427
jsonorg_sample4.json | 3467 |2466 |2710 |
9.00369003690037
jsonorg_sample5.json | 872 | 469 | 613 |
23.4910277324633
(5 rows)

=== TEST PARSE AND SERIALIZATION ===
Null Parse:
12.12 real 2.51 user 0.13 sys
Parse to Binary:
13.38 real 2.51 user 0.14 sys
Serialize from Binary:
16.65 real 2.51 user 0.13 sys
Normalize Text:
18.99 real 2.51 user 0.13 sys
Roundtrip (parse to binary and serialize):
18.58 real 2.51 user 0.14 sys

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-09 Thread Terry Laurenzo

Robert,
I think I agree.  At a minimum, I would like to see the chosen of the
competing priorities live on as an outside module for use by previous
versions.  Even having proposed one, and soon to be two of the competing
implementations, it makes me nervous to commit to one at this juncture.

I'm wrapping some items up this week but expect to have some time over the
next two weeks to complete my implementation.

Here's a quick status on where I'm at:
   - Binary format has been implemented as specified here:
https://github.com/tlaurenzo/pgjson/blob/master/pgjson/jsonlib/BINARY-README.txt
   - Hand coded a JSON-text lexer/parser and JSON-binary parser and
transcoders
   - Ran out of time to do exhaustive tests, but typical payloads yield
significant space savings
   - Based on an admittedly small number of test cases, I identified that
the process of encoding a string literal is the most expensive operation in
the general case, accounting for 1/2 to 2/3 of the time spent in a
transcoding operation.  This is fairly obvious but good to know.  I did not
spend any time looking into this further.
   - Drastically simplified the code in preparation to build a stand-alone
module

As soon as I get a bit of time I was going to do the following:
   - Create a simple PGXS based build, stripping out the rest of the bits I
was doodling on
   - Re-implement the PG module based on the new jsonlib binary format and
parser
   - Add JSONPath and some encoding bits back in from the original patch
   - Do some holistic profiling between the JSON-as-Text approach and the
JSON-as-Binary approach

This is still a bit of a fishing expedition, imo and I would have a hard
time getting this ready for commit on Monday.  If getting something in right
now is critical, Joey's original patch is the most complete at this point.

Terry


On Tue, Nov 9, 2010 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Nov 8, 2010 at 9:52 PM, Itagaki Takahiro
 itagaki.takah...@gmail.com wrote:
  Are there any activities in JSON data types for the next commitfest?

 I'm leaning toward the view that we shouldn't commit a JSON
 implementation to core (or contrib) for 9.1.  We have at least three
 competing proposals on the table.  I thought of picking it up and
 hacking on it myself, but then we'd have four competing proposals on
 the table.  Even if we could come to some consensus on which of those
 proposals is technically superior, the rate at which new ideas are
 being proposed suggests to me that it would be premature to anoint any
 single implementation as our canonical one.  I'd like to see some of
 these patches finished and put up on pgfoundry or github, and then
 consider moving one of them into core when we have a clear and stable
 consensus that one of them is technically awesome and the best thing
 we're going to get.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-24 Thread Terry Laurenzo


 It doesn't do particularly well on my previous example of [1,2,3].  It
 comes out slightly shorter on [a,b,c] and better if the strings
 need any escaping.  I don't think the float4 and float8 formats are
 very useful; how could you be sure that the output was going to look
 the same as the input?  Or alternatively that dumping a particular
 object to text and reloading it will produce the same internal
 representation?  I think it would be simpler to represent integers
 using a string of digits; that way you can be assured of going from
 text - binary - text without change.

 Perhaps it would be enough to define the high two bits as follows: 00
 = array, 01 = object, 10 = string, 11 = number/true/false/null.  The
 next 2 bits specify how the length is stored.  00 = remaining 4 bits
 store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional
 byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2
 additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional
 bytes store a full 32-bit length word.  Then, the array, object, and
 string representations can work as you've specified them.  Anything
 else can be represented by itself, or perhaps we should say that
 numbers represent themselves and true/false/null are represented by a
 1-byte sequence, t/f/n (or perhaps we could define 00{00,01,10} to
 mean those values, since there's no obvious reason for the low bits to
 be non-zero if a 4-bit length word ostensibly follows).

 So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and [a,b,c] = 06 81 'a' 81 'b'
 81 'c'

 (I am still worried about the serialization/deserialization overhead
 but that's a different issue.)


Thanks.  I'll play around with the bit and numeric encodings you've
recommended.  Arrays are certainly the toughest to economize on as a text
encoding has minimum of 2 + n - 1 overhead bytes.  Text encoding for objects
has some more wiggle room with 2 + (n-1) + (n*3) extra bytes.  I was
admittedly thinking about more complicated objects.

I'm still worried about transcoding overhead as well.  If comparing to a
simple blind storage of JSON text with no validation or normalization, there
is obviously no way to beat a straight copy.  However, its not outside the
realm of reason to think that it may be possible to match or beat the clock
if comparing against text to text normalization, or perhaps adding slight
overhead to a validator.  The advantage to the binary structure is the
ability to scan it hierarchically in sibling order and provide mutation
operations with simple memcpy's as opposed to parse_to_ast - modify_ast -
serialize_ast.

Terry

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-24 Thread Terry Laurenzo


 Yeah, my concern is not whether the overhead will be zero; it's
 whether it will be small, yet allow large gains on other operations.
 Like, how much slower will it be to pull out a moderately complex 1MB
 JSON blob (not just a big string) out of a single-row, single-column
 table?  If it's 5% slower, that's probably OK, since this is a
 reasonable approximation of a worst-case scenario.  If it's 50%
 slower, that sounds painful.  It would also be worth testing with a
 much smaller size, such as a 1K object with lots of internal
 structure.  In both cases, all data cached in shared_buffers, etc.

 Then on the flip side how do we do on val[37][whatever]?  You'd like
 to hope that this will be significantly faster than the text encoding
 on both large and small objects.  If it's not, there's probably not
 much point.


We're on the same page.  I'm implementing the basic cases now and then will
come up with some benchmarks.

Terry

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-23 Thread Terry Laurenzo



 I'm still going to write up a proposed grammar that takes these items into
 account - just ran out of time tonight.


The binary format I was thinking of is here:

http://github.com/tlaurenzo/pgjson/blob/master/pgjson/shared/include/json/jsonbinary.h

This was just a quick brain dump and I haven't done a lot of diligence on
verifying it, but I think it should be more compact than most JSON text
payloads and quick to iterate over/update in sibling traversal order vs
depth-first traversal which is what we would get out of JSON text.

Thoughts?

Terry

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-20 Thread Terry Laurenzo

Good points.  In addition, any binary format needs to support object
property traversal without having to do a deep scan of all descendants.
BSON handles this with explicit lengths for document types (objects and
arrays) so that entire parts of the tree can be skipped during sibling
traversal.

It would also be nice to make sure that we store fully parsed strings.
There are lots of escape options that simply do not need to be preserved (c
escapes, unicode, octal, hex sequences) and hinder the ability to do direct
comparisons.  BSON also makes a small extra effort to ensure that object
property names are encoded in a way that is easily comparable, as this will
be the most frequently compared items.

I'm still going to write up a proposed grammar that takes these items into
account - just ran out of time tonight.

Terry

On Wed, Oct 20, 2010 at 12:46 AM, Itagaki Takahiro 
itagaki.takah...@gmail.com wrote:

 On Wed, Oct 20, 2010 at 6:39 AM, Terry Laurenzo t...@laurenzo.org wrote:
  The answer may be to have both a jsontext and jsonbinary type as each
 will
  be optimized for a different case.

 I want to choose one format for JSON rather than having two types.
 It should be more efficient than other format in many cases,
 and not so bad in other cases.

 I think the discussion was started with
  BSON could represent was a subset of what JSON could represent.
 So, any binary format could be acceptable that have enough
 representational power compared with text format.

 For example, a sequence of byte-length text could reduce
 CPU cycles for reparsing and hold all of the input as-is except
 ignorable white-spaces. It is not a BSON, but is a binary format.

 Or, if we want to store numbers in binary form, I think the
 format will be numeric type in postgres. It has high precision,
 and we don't need any higher precision than it to compare two
 numbers eventually. Even if we use BSON format, we need to extend
 it to store all of numeric values, that precision is 10^1000.

 --
 Itagaki Takahiro

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-19 Thread Terry Laurenzo

Agreed.  BSON was born out of implementations that either lacked arbitrary
precision numbers or had a strong affinity to an int/floating point way of
thinking about numbers.  I believe that if BSON had an arbitrary precision
number type, it would be a proper superset of JSON.

As an aside, the max range of an int in BSON 64bits.  Back to my original
comment that BSON was grown instead of designed, it looks like both the
32bit and 64bit integers were added late in the game and that the original
designers perhaps were just going to store all numbers as double.

Perhaps we should enumerate the attributes of what would make a good binary
encoding?

Terry

On Tue, Oct 19, 2010 at 8:57 AM, Andrew Dunstan and...@dunslane.net wrote:



 On 10/19/2010 10:44 AM, Robert Haas wrote:

 On Sat, Oct 16, 2010 at 12:59 PM, Terry Laurenzot...@laurenzo.org  wrote:

- It is directly iterable without parsing and/or constructing an AST
- It is its own representation.  If iterating and you want to tear-off
 a
 value to be returned or used elsewhere, its a simple buffer copy plus
 some
 bit twiddling.
- It is conceivable that clients already know how to deal with BSON,
 allowing them to work with the internal form directly (ala MongoDB)
- It stores a wider range of primitive types than JSON-text.  The most
 important are Date and binary.

 When last I looked at that, it appeared to me that what BSON could
 represent was a subset of what JSON could represent - in particular,
 that it had things like a 32-bit limit on integers, or something along
 those lines.  Sounds like it may be neither a superset nor a subset,
 in which case I think it's a poor choice for an internal
 representation of JSON.


 Yeah, if it can't handle arbitrary precision numbers as has previously been
 stated it's dead in the water for our purposes, I think.

 cheers

 andrew

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-19 Thread Terry Laurenzo

On Tue, Oct 19, 2010 at 2:46 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Oct 19, 2010 at 3:40 PM, Joseph Adams
 joeyadams3.14...@gmail.com wrote:
  On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas robertmh...@gmail.com
 wrote:
  I think we should take a few steps back and ask why we think that
  binary encoding is the way to go.  We store XML as text, for example,
  and I can't remember any complaints about that on -bugs or
  -performance, so why do we think JSON will be different?  Binary
  encoding is a trade-off.  A well-designed binary encoding should make
  it quicker to extract a small chunk of a large JSON object and return
  it; however, it will also make it slower to return the whole object
  (because you're adding serialization overhead).  I haven't seen any
  analysis of which of those use cases is more important and why.
 
  Speculation: the overhead involved with retrieving/sending and
  receiving/storing JSON (not to mention TOAST
  compression/decompression) will be far greater than that of
  serializing/unserializing.

 I speculate that your speculation is incorrect.  AIUI, we, unlike
 $COMPETITOR, tend to be CPU-bound rather than IO-bound on COPY.  But
 perhaps less speculation and more benchmarking is in order.


After spending a week in the morass of this, I have to say that I am less
certain than I was on any front regarding the text/binary distinction.  I'll
take some time and benchmark different cases.  My hypothesis is that a well
implemented binary structure and conversions will add minimal overhead in
the IO + Validate case which would be the typical in/out flow.  It could be
substantially faster for binary send/receive because the validation step
could be eliminated/reduced.  Further storing as binary reduces the overhead
of random access to the data by database functions.

I'm envisioning staging this up as follows:
   1. Create a jsontext.  jsontext uses text as its internal
representation.  in/out functions are essentially a straight copy or a copy
+ validate.
   2. Create a jsonbinary type.  This uses an optimized binary format for
internal rep and send/receive.  in/out is a parse/transcode operation to
standard JSON text.
   3. Internal data access functions and JSON Path require a jsonbinary.
   4. There are implicit casts to/from jsontext and jsonbinary.

I've got a grammar in mind for the binary structure that I'll share later
when I've got some more time.  It's inspired by $COMPETITOR's format but a
little more sane, using type tags that implicitly define the size of the
operands, simplifying parsing.

I'll then define the various use cases and benchmark using the different
types.  Some examples include such as IO No Validate, IO+Validate, Store and
Index, Internal Processing, Internal Composition, etc.

The answer may be to have both a jsontext and jsonbinary type as each will
be optimized for a different case.

Make sense?  It may be a week before I get through this.
Terry

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-19 Thread Terry Laurenzo

On Tue, Oct 19, 2010 at 4:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Terry Laurenzo t...@laurenzo.org writes:
  After spending a week in the morass of this, I have to say that I am less
  certain than I was on any front regarding the text/binary distinction.
  I'll
  take some time and benchmark different cases.  My hypothesis is that a
 well
  implemented binary structure and conversions will add minimal overhead in
  the IO + Validate case which would be the typical in/out flow.  It could
 be
  substantially faster for binary send/receive because the validation step
  could be eliminated/reduced.

 I think that arguments proceeding from speed of binary send/receive
 aren't worth the electrons they're written on, because there is nothing
 anywhere that says what the binary format ought to be.  In the case of
 XML we're just using the text representation as the binary format too,
 and nobody's complained about that.  If we were to choose to stick with
 straight text internally for a JSON type, we'd do the same thing, and
 again nobody would complain.

 So, if you want to make a case for using some binary internal format or
 other, make it without that consideration.

  I'm envisioning staging this up as follows:
 1. Create a jsontext.  jsontext uses text as its internal
  representation.  in/out functions are essentially a straight copy or a
 copy
  + validate.
 2. Create a jsonbinary type.  This uses an optimized binary format
 for
  internal rep and send/receive.  in/out is a parse/transcode operation to
  standard JSON text.

 Ugh.  Please don't.  JSON should be JSON, and nothing else.  Do you see
 any other datatypes in Postgres that expose such internal considerations?

regards, tom lane


I don't think anyone here was really presenting arguments as yet.  We're
several layers deep on speculation and everyone is saying that
experimentation is needed.

I've got my own reasons for going down this path for a solution I have in
mind.  I had thought that some part of that might have been applicable to pg
core, but if not, that's no problem.  For my own edification, I'm going to
proceed down this path and see where it leads.  I'll let the list know what
I find out.

I can understand the sentiment that JSON should be JSON and nothing else
from a traditional database server's point of view, but there is nothing
sacrosanct about it in the broader context.

Terry

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-18 Thread Terry Laurenzo

  I like as simple design as we can accept. ISTM format, I/O interface,
  simple get/set, mapping tuple from/to object, and indexing are minimum
  requirement.

 +1 to small start, but simple get/set are already debatable...
 For example, text/json conversion:
  A. SELECT 'json'::json;
  B. SELECT 'text'::text::json;

 In the git repo, A calls parse_json_to_bson_as_vardata(), so the input
 should be a json format. OTOH, B calls pgjson_json_from_text(), so the
 input can be any text. Those behaviors are surprising. I think we have
 no other choice but to define text-to-json cast as parsing. The same
 can be said for json-to-text -- type-output function vs. extracting
 text value from json.

 I think casting text to/from json should behave in the same way as type
 input/output. The xml type works in the same manner. And if so, we might
 not have any casts to/from json for consistency, even though there are
 no problems in casts for non-text types.


I just reworked some of this last night, so I'm not sure which version you
are referring to (new version has a pgplugin/jsoncast.c source file).  I was
basically circling around the same thing as you trying to find something
that felt natural and not confusing.  I agree that we don't have much of a
choice to keep in/out functions as parse/serialize and that then introducing
casts that do differently creates confusion.  When I was playing with it, I
was getting confused, and I wrote it. :)

An alternative to pg casting to extract postgres values could be to
introduce analogs to JavaScript constructors, which is the JavaScript way to
cast.  For example: String(json), Number(json), Date(json).  This would feel
natural to a JavaScript programmer and would be explicit and non-surprising:
   A. SELECT String('{a: 1, b:2}'::json - 'a')  (Returns postgres text)
   B. SELECT Number('1'::json)   (Returns postgres decimal)

I think that the most common use case for this type of thing in the DB will
be to extract a JSON scalar as a postgres scalar.

The inverse, while probably less useful, is currently represented by the
json_from_* functions.  We could collapse all of these down to one
overloaded function, say ToJson(...):
   A. SELECT ToJson(1)   (Would return a json type with an int32 1 value)
   B. SELECT ToJson('Some String')   (Would return a json type with a string
value)

There might be some syntactic magic we could do by adding an intermediate
jsonscalar type, but based on trying real cases with this stuff, you always
end up having to be explicit about your conversions anyway.  Having implicit
type coercion from this polymorphic type tends to make things confusing,
imo.



 I'll list issues before we start json types even in the simplest cases:
 
 1. where to implement json core: external library vs. inner postgres
 2. internal format: text vs. binary (*)
 3. encoding: always UTF-8 vs. database encoding (*)
 4. meaning of casts text to/from json: parse/stringify vs. get/set
 5. parser implementation: flex/bison vs. hand-coded.
 
 (*) Note that we would have comparison two json values in the future. So,
 we might need to normalize the internal format even in text representation.

 The most interesting parts of json types, including indexing and jsonpath,
 would be made on the json core. We need conclusions about those issues.


My opinions or ramblings on the above:
   1. There's a fair bit of code involved for something that many are going
to gloss over.  I can think of pros/cons for external/internal/contrib and
I'm not sure which I would choose.
   2. I'm definitely in the binary camp, but part of the reason for building
it out was to try it with some real world cases to get a feel for
performance implications end to end.  We make heavy use of MongoDB at the
office and I was thinking it might make sense to strip some of those cases
down and see how they would be implemented in this context.  I'll write up
more thoughts on how I think text/binary should perform for various cases
tonight.
   3. I think if we go with binary, we should always store UTF-8 in the
binary structure.  Otherwise, we just have too much of the guts of the
binary left to the whim of the database encoding.  As currently implemented,
all strings generated by the in/out functions should be escaped so that they
are pure ascii (not quite working, but there in theory).  JSON is by
definition UTF-8, and in this case, I think it trumps database encoding.
   4. My thoughts on the casts are above.
   5. There seems to be a lot of runtime and code size overhead inherent in
the flex/bison parsers, especially considering that they will most
frequently be invoked for very small streams.  Writing a good hand-coded
parser for comparison is just a matter of which bottle of wine to choose
prior to spending the hours coding it, and I would probably defer the
decision until later.

Terry

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-10-16 Thread Terry Laurenzo

Hi all -
I independently started some work on a similar capability as was contributed
back in August by Joey Adams for a json datatype.  Before starting, I did a
quick search but for some reason didn't turn this existing thread up.

What I've been working on is out on github for now:
http://github.com/tlaurenzo/pgjson

When I started, I was actually aiming for something else, and got caught up
going down this rabbit hole.  I took a different design approach, making the
internal form be an extended BSON stream and implementing event-driven
parsing and serializing to the different formats.  There was some discussion
in the original thread around storing plain text vs a custom format.  I have
to admit I've been back and forth a couple of times on this and have come to
like a BSON-like format for the data at rest.

Pros:
   - It is directly iterable without parsing and/or constructing an AST
   - It is its own representation.  If iterating and you want to tear-off a
value to be returned or used elsewhere, its a simple buffer copy plus some
bit twiddling.
   - It is conceivable that clients already know how to deal with BSON,
allowing them to work with the internal form directly (ala MongoDB)
   - It stores a wider range of primitive types than JSON-text.  The most
important are Date and binary.

Cons:
   - The format appears to have been grown.  Some of the critical
decisions were made late in the game (ie. why would your integral type codes
be last)
   - Natively, the format falls victim to the artificial document vs element
distinction, which I never understood.  I've worked around this with an
escape mechanism for representing root values, but its not great.
   - The processor is not resilient in the face of unknown element types

I'm leaning towards thinking that the determination comes down to the
following:
   - If you just want a checkbox item that the database has a json
datatype and some support functions, storing as text may make sense.  It can
be much simpler; however, it becomes increasingly hard to do anything real
without adding a parse to AST, manipulate, dump to text cycle to every
function.
   - If you want a json datatype that is highly integrated and manipulable,
you want a binary datastructure and in the absence of any other contender in
this area, BSON is ok (not great, but ok).
   - The addition of a JavaScript procedural language probably does not
bring its own format for data at rest.  All of the engines I know of (I
haven't looked at what Guile is doing) do not have a static representation
for internal data structures.  They are heap objects with liberal use of
internal and external pointers.  Most do have a mechanism, however, for
injecting foreign objects into the runtime without resorting to making a
dumb copy.  As such, the integration approach would probably be to determine
the best format for JSON data at rest and provide adapters to the chosen
JavaScript runtime to manipulate this at-rest format directly (potentially
using a copy on write approach).  If the at-rest format is Text, then you
would need to do a parse-to-AST step for each JavaScript function
invocation.

Here's a few notes on my current implementation:
   - Excessive use of lex/yacc: This was quick and easy but the grammars are
simple enough that I'd probably hand-code a parser for any final solution.
   - When the choice between following the json.org spec to the letter and
implementing lenient parsing for valid JavaScript constructs arose, I chose
lenient.
   - Too much buffer copying: When I started, I was just doodling with
writing C code to manipulate JSON/BSON and not working with postgres in
particular.  As such, it all uses straight malloc/free and too many copies
are made to get things in and out of VARDATA structures.  This would all be
eliminated in any real version.
   - UTF-8 is supported but not fully working completely.  The support
functions that Joey wrote do a better job at this.
   - My json path evaluation is crippled.  Given the integration with the PG
type system, I thought I just wanted a simple property traversal mechanism,
punting higher level manipulation to native PG functions.  Seeing real
JSONPath work, though, I'm not so sure.  I like the simplicity of what I've
done but the features of the full bit are nice too.
   - This is first-pass prototype code with the goal of seeing it all
working together.

While I had an end in mind, I did a lot of this for the fun of it and to
just scratch an itch, so I'm not really advocating for anything at this
point.  I'm curious as to what others think the state of JSON and Postgres
should be.  I've worked with JavaScript engines a good deal and would be
happy to help get us there, either using some of the work/approaches here or
going in a different direction.

Terry

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

11 matches

Site Navigation

Mail list logo

Footer information