Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Ilya Khlopotov Wed, 30 Jan 2019 09:54:40 -0800

FoundationDB Records layer uses global schema for JSON documents. They also 
have a nice way of creating indexes and schema evolution support. However this 
support comes at a cost of extra lookups in different subspace. With local 
mapping table we almost (except a corner case) certain that the schema and JSON 
fields would be collocated on a single node. Due to common prefix.


Best regards,
iilyak
On 2019/01/30 17:05:01, Jan Lehnardt <[email protected]> wrote: 
> Ah sure, if we store the *cough* schema per doc, then it's not that easy. An 
> iteration of this proposal could store paths globally with ids that the k/v 
> store then uses for keys, which would enable what I described, but happy to 
> ignore this for the time being. :)
> 
> Cheers
> Jan
> —
> 
> > On 30. Jan 2019, at 17:58, Adam Kocoloski <[email protected]> wrote:
> > 
> > Jan, I don’t think it does have that "fun property #2", as the mapping is 
> > created separately for each document. In this proposal the field name “foo” 
> > could map to 2 in one document and 42 in another.
> > 
> > Thanks for the proposal Ilya. Personally I wonder if the 10KB limit on 
> > field paths is anything more than a theoretical concern. It’s hard for me 
> > to imagine a useful schema that would get anywhere near that deep, but 
> > maybe I’m insufficiently creative :) There’s certainly a storage overhead 
> > from repeating the upper portion of a path over and over again, but that’s 
> > also something the storage engine can optimize away through prefix elision. 
> > The current production storage engine in FoundationDB does not do this 
> > elision, but the new one in development does.
> > 
> > The value size limit is probably not so theoretical. I think as a project 
> > we could choose to impose a 100KB size limit on scalar values - a user who 
> > had a string longer than 100KB could chunk it up into an array of strings 
> > pretty easily to work around that limit. But let’s say we don’t want to 
> > impose that limit. In your design, how do I distinguish {PART_IDX} from the 
> > elements of the {JSON_PATH}? I was kind of expecting to see some magic 
> > value indicating that the subsequent set of keys with the same prefix are 
> > all elements of a “multi-part object”:
> > 
> > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}  = kMULTIPART
> > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}  = “First 100 KB …"
> > ...
> > 
> > You might have figured out something more efficient that saves a KV here 
> > but I can’t quite grok it.
> > 
> > Cheers, Adam
> > 
> > 
> >> On Jan 30, 2019, at 8:24 AM, Jan Lehnardt <[email protected]> wrote:
> >> 
> >> 
> >> 
> >>> On 30. Jan 2019, at 14:22, Jan Lehnardt <[email protected] 
> >>> <mailto:[email protected]>> wrote:
> >>> 
> >>> Thanks Ilya for getting this started!
> >>> 
> >>> Two quick notes on this one:
> >>> 
> >>> 1. note that JSON does not guarantee object key order and that CouchDB 
> >>> has never guaranteed it either, and with say emit(doc.foo, doc.bar), if 
> >>> either emit() parameter was an object, the undefined-sort-order of 
> >>> SpiderMonkey would mix things up. While worth bringing up, this is not a 
> >>> BC break.
> >>> 
> >>> 2. This would have the fun property of being able to rename a key inside 
> >>> all docs that have that key.
> >> 
> >> …in one short operation.
> >> 
> >> Best
> >> Jan
> >> —
> >>> 
> >>> Best
> >>> Jan
> >>> —
> >>> 
> >>>> On 30. Jan 2019, at 14:05, Ilya Khlopotov <[email protected]> wrote:
> >>>> 
> >>>> # First proposal
> >>>> 
> >>>> In order to overcome FoudationDB limitations on key size (10 kB) and 
> >>>> value size (100 kB) we could use the following approach.
> >>>> 
> >>>> Bellow the paths are using slash for illustration purposes only. We can 
> >>>> use nested subspaces, tuples, directories or something else. 
> >>>> 
> >>>> - Store documents in a subspace or directory  (to keep prefix for a key 
> >>>> short)
> >>>> - When we store the document we would enumerate all field names (0 and 1 
> >>>> are reserved) and store the mapping table in the key which look like:
> >>>> ```
> >>>> {DB_DOCS_NS} / {DOC_KEY} / 0
> >>>> ```
> >>>> - Flatten the JSON document (convert it into key value pairs where the 
> >>>> key is `JSON_PATH` and value is `SCALAR_VALUE`)
> >>>> - Replace elements of JSON_PATH with integers from mapping table we 
> >>>> constructed earlier
> >>>> - When we have array use `1 / {array_idx}`
> >>>> - Store scalar values in the keys which look like the following (we use 
> >>>> `JSON_PATH` with integers). 
> >>>> ```
> >>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}
> >>>> ```
> >>>> - If the scalar value exceeds 100kB we would split it and store every 
> >>>> part under key constructed as:
> >>>> ```
> >>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}
> >>>> ```
> >>>> 
> >>>> Since all parts of the documents are stored under a common `{DB_DOCS_NS} 
> >>>> / {DOC_KEY}` they will be stored on the same server most of the time. 
> >>>> The document can be retrieved by using range query 
> >>>> (`txn.get_range("{DB_DOCS_NS} / {DOC_KEY} / 0", "{DB_DOCS_NS} / 
> >>>> {DOC_KEY} / 0xFF")`). We can reconstruct the document since the mapping 
> >>>> is returned as well.
> >>>> 
> >>>> The downside of this approach is we wouldn't be able to ensure the same 
> >>>> order of keys in the JSON object. Currently the `jiffy` JSON encoder 
> >>>> respects order of keys.
> >>>> ```
> >>>> 4> jiffy:encode({[{bbb, 1}, {aaa, 12}]}).
> >>>> <<"{\"bbb\":1,\"aaa\":12}">>
> >>>> 5> jiffy:encode({[{aaa, 12}, {bbb, 1}]}).
> >>>> <<"{\"aaa\":12,\"bbb\":1}">>
> >>>> ```
> >>>> 
> >>>> Best regards,
> >>>> iilyak
> >>>> 
> >>>>> On 2019/01/30 13:02:57, Ilya Khlopotov <[email protected]> wrote: 
> >>>>> As you might already know the FoundationDB has a number of limitations 
> >>>>> which influences the way we might store JSON documents. The limitations 
> >>>>> are:
> >>>>> 
> >>>>> |      limitation             |recommended value|recommended 
> >>>>> max|absolute max|
> >>>>> |-------------------------|----------------------:|--------------------:|--------------:|
> >>>>> | transaction duration  |                              |                
> >>>>>            |      5 sec      |
> >>>>> | transaction data size |                              |                
> >>>>>            |      10 Mb     |
> >>>>> | key size                   |                 32 bytes |               
> >>>>>     1 kB  |     10 kB      |
> >>>>> | value size                |                               |           
> >>>>>        10 kB |    100 kB     |
> >>>>> 
> >>>>> In order to fit the JSON document into 100kB we would have to partition 
> >>>>> it in some way. There are three ways of partitioning the document
> >>>>> 1. store multiple binary blobs (parts) in different keys
> >>>>> 2. flatten JSON structure and store every path leading to a scalar 
> >>>>> value under own key
> >>>>> 3. measure the size of different branches of a tree representing the 
> >>>>> JSON document (while we parse) and use another key for the branch when 
> >>>>> we about to exceed the limit
> >>>>> 
> >>>>> - The first approach is the simplest but it wouldn't allow us to access 
> >>>>> parts of the document.
> >>>>> - The downsides of a second approach are:
> >>>>> - flattened JSON structure would have long paths which means longer keys
> >>>>> - the scalar value cannot be more than 100kb (unless we split it as 
> >>>>> well)
> >>>>> - Third approach falls short in cases when the structure of the 
> >>>>> document doesn't allow a clean cut off branches:
> >>>>> - complex rules to handle all corner cases
> >>>>> 
> >>>>> The goals of this thread are:
> >>>>> - to collect ideas on how to encode and store the JSON document
> >>>>> - to comment on the collected ideas
> >>>>> 
> >>>>> Non goals:
> >>>>> - the storage of metadata for the document would be discussed elsewhere
> >>>>> - thumb stones
> >>>>> - edit conflicts
> >>>>> - revisions 
> >>>>> 
> >>>>> Best regards,
> >>>>> iilyak
> >>>>> 
> >>> 
> >>> -- 
> >>> Professional Support for Apache CouchDB:
> >>> https://neighbourhood.ie/couchdb-support/
> >>> 
> >> 
> >> -- 
> >> Professional Support for Apache CouchDB:
> >> https://neighbourhood.ie/couchdb-support/ 
> >> <https://neighbourhood.ie/couchdb-support/>
> 
>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to