Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Ilya Khlopotov Wed, 30 Jan 2019 09:49:23 -0800

> I was kind of expecting to see some magic value indicating that the 
> subsequent set of keys with the same prefix are all elements of a “multi-part 
> object”
I missed this aspect. This is easy to solve (as you've mentioned) by using 
either a special character or reserved value in the mapping table.


On 2019/01/30 16:58:29, Adam Kocoloski <[email protected]> wrote: 
> Jan, I don’t think it does have that "fun property #2", as the mapping is 
> created separately for each document. In this proposal the field name “foo” 
> could map to 2 in one document and 42 in another.
> 
> Thanks for the proposal Ilya. Personally I wonder if the 10KB limit on field 
> paths is anything more than a theoretical concern. It’s hard for me to 
> imagine a useful schema that would get anywhere near that deep, but maybe I’m 
> insufficiently creative :) There’s certainly a storage overhead from 
> repeating the upper portion of a path over and over again, but that’s also 
> something the storage engine can optimize away through prefix elision. The 
> current production storage engine in FoundationDB does not do this elision, 
> but the new one in development does.
> 
> The value size limit is probably not so theoretical. I think as a project we 
> could choose to impose a 100KB size limit on scalar values - a user who had a 
> string longer than 100KB could chunk it up into an array of strings pretty 
> easily to work around that limit. But let’s say we don’t want to impose that 
> limit. In your design, how do I distinguish {PART_IDX} from the elements of 
> the {JSON_PATH}? I was kind of expecting to see some magic value indicating 
> that the subsequent set of keys with the same prefix are all elements of a 
> “multi-part object”:
> 
> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}  = kMULTIPART
> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}  = “First 100 KB …"
> ...
> 
> You might have figured out something more efficient that saves a KV here but 
> I can’t quite grok it.
> 
> Cheers, Adam
> 
> 
> > On Jan 30, 2019, at 8:24 AM, Jan Lehnardt <[email protected]> wrote:
> > 
> > 
> > 
> >> On 30. Jan 2019, at 14:22, Jan Lehnardt <[email protected] 
> >> <mailto:[email protected]>> wrote:
> >> 
> >> Thanks Ilya for getting this started!
> >> 
> >> Two quick notes on this one:
> >> 
> >> 1. note that JSON does not guarantee object key order and that CouchDB has 
> >> never guaranteed it either, and with say emit(doc.foo, doc.bar), if either 
> >> emit() parameter was an object, the undefined-sort-order of SpiderMonkey 
> >> would mix things up. While worth bringing up, this is not a BC break.
> >> 
> >> 2. This would have the fun property of being able to rename a key inside 
> >> all docs that have that key.
> > 
> > …in one short operation.
> > 
> > Best
> > Jan
> > —
> >> 
> >> Best
> >> Jan
> >> —
> >> 
> >>> On 30. Jan 2019, at 14:05, Ilya Khlopotov <[email protected]> wrote:
> >>> 
> >>> # First proposal
> >>> 
> >>> In order to overcome FoudationDB limitations on key size (10 kB) and 
> >>> value size (100 kB) we could use the following approach.
> >>> 
> >>> Bellow the paths are using slash for illustration purposes only. We can 
> >>> use nested subspaces, tuples, directories or something else. 
> >>> 
> >>> - Store documents in a subspace or directory  (to keep prefix for a key 
> >>> short)
> >>> - When we store the document we would enumerate all field names (0 and 1 
> >>> are reserved) and store the mapping table in the key which look like:
> >>> ```
> >>> {DB_DOCS_NS} / {DOC_KEY} / 0
> >>> ```
> >>> - Flatten the JSON document (convert it into key value pairs where the 
> >>> key is `JSON_PATH` and value is `SCALAR_VALUE`)
> >>> - Replace elements of JSON_PATH with integers from mapping table we 
> >>> constructed earlier
> >>> - When we have array use `1 / {array_idx}`
> >>> - Store scalar values in the keys which look like the following (we use 
> >>> `JSON_PATH` with integers). 
> >>> ```
> >>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}
> >>> ```
> >>> - If the scalar value exceeds 100kB we would split it and store every 
> >>> part under key constructed as:
> >>> ```
> >>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}
> >>> ```
> >>> 
> >>> Since all parts of the documents are stored under a common `{DB_DOCS_NS} 
> >>> / {DOC_KEY}` they will be stored on the same server most of the time. The 
> >>> document can be retrieved by using range query 
> >>> (`txn.get_range("{DB_DOCS_NS} / {DOC_KEY} / 0", "{DB_DOCS_NS} / {DOC_KEY} 
> >>> / 0xFF")`). We can reconstruct the document since the mapping is returned 
> >>> as well.
> >>> 
> >>> The downside of this approach is we wouldn't be able to ensure the same 
> >>> order of keys in the JSON object. Currently the `jiffy` JSON encoder 
> >>> respects order of keys.
> >>> ```
> >>> 4> jiffy:encode({[{bbb, 1}, {aaa, 12}]}).
> >>> <<"{\"bbb\":1,\"aaa\":12}">>
> >>> 5> jiffy:encode({[{aaa, 12}, {bbb, 1}]}).
> >>> <<"{\"aaa\":12,\"bbb\":1}">>
> >>> ```
> >>> 
> >>> Best regards,
> >>> iilyak
> >>> 
> >>> On 2019/01/30 13:02:57, Ilya Khlopotov <[email protected]> wrote: 
> >>>> As you might already know the FoundationDB has a number of limitations 
> >>>> which influences the way we might store JSON documents. The limitations 
> >>>> are:
> >>>> 
> >>>> |      limitation             |recommended value|recommended 
> >>>> max|absolute max|
> >>>> |-------------------------|----------------------:|--------------------:|--------------:|
> >>>> | transaction duration  |                              |                 
> >>>>           |      5 sec      |
> >>>> | transaction data size |                              |                 
> >>>>           |      10 Mb     |
> >>>> | key size                   |                 32 bytes |                
> >>>>    1 kB  |     10 kB      |
> >>>> | value size                |                               |            
> >>>>       10 kB |    100 kB     |
> >>>> 
> >>>> In order to fit the JSON document into 100kB we would have to partition 
> >>>> it in some way. There are three ways of partitioning the document
> >>>> 1. store multiple binary blobs (parts) in different keys
> >>>> 2. flatten JSON structure and store every path leading to a scalar value 
> >>>> under own key
> >>>> 3. measure the size of different branches of a tree representing the 
> >>>> JSON document (while we parse) and use another key for the branch when 
> >>>> we about to exceed the limit
> >>>> 
> >>>> - The first approach is the simplest but it wouldn't allow us to access 
> >>>> parts of the document.
> >>>> - The downsides of a second approach are:
> >>>> - flattened JSON structure would have long paths which means longer keys
> >>>> - the scalar value cannot be more than 100kb (unless we split it as well)
> >>>> - Third approach falls short in cases when the structure of the document 
> >>>> doesn't allow a clean cut off branches:
> >>>> - complex rules to handle all corner cases
> >>>> 
> >>>> The goals of this thread are:
> >>>> - to collect ideas on how to encode and store the JSON document
> >>>> - to comment on the collected ideas
> >>>> 
> >>>> Non goals:
> >>>> - the storage of metadata for the document would be discussed elsewhere
> >>>> - thumb stones
> >>>> - edit conflicts
> >>>> - revisions 
> >>>> 
> >>>> Best regards,
> >>>> iilyak
> >>>> 
> >> 
> >> -- 
> >> Professional Support for Apache CouchDB:
> >> https://neighbourhood.ie/couchdb-support/
> >> 
> > 
> > -- 
> > Professional Support for Apache CouchDB:
> > https://neighbourhood.ie/couchdb-support/ 
> > <https://neighbourhood.ie/couchdb-support/>
>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to