Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Jan Lehnardt Wed, 30 Jan 2019 09:06:21 -0800

Ah sure, if we store the *cough* schema per doc, then it's not that easy. An 
iteration of this proposal could store paths globally with ids that the k/v 
store then uses for keys, which would enable what I described, but happy to 
ignore this for the time being. :)


Cheers
Jan
—

> On 30. Jan 2019, at 17:58, Adam Kocoloski <kocol...@apache.org> wrote:
> 
> Jan, I don’t think it does have that "fun property #2", as the mapping is 
> created separately for each document. In this proposal the field name “foo” 
> could map to 2 in one document and 42 in another.
> 
> Thanks for the proposal Ilya. Personally I wonder if the 10KB limit on field 
> paths is anything more than a theoretical concern. It’s hard for me to 
> imagine a useful schema that would get anywhere near that deep, but maybe I’m 
> insufficiently creative :) There’s certainly a storage overhead from 
> repeating the upper portion of a path over and over again, but that’s also 
> something the storage engine can optimize away through prefix elision. The 
> current production storage engine in FoundationDB does not do this elision, 
> but the new one in development does.
> 
> The value size limit is probably not so theoretical. I think as a project we 
> could choose to impose a 100KB size limit on scalar values - a user who had a 
> string longer than 100KB could chunk it up into an array of strings pretty 
> easily to work around that limit. But let’s say we don’t want to impose that 
> limit. In your design, how do I distinguish {PART_IDX} from the elements of 
> the {JSON_PATH}? I was kind of expecting to see some magic value indicating 
> that the subsequent set of keys with the same prefix are all elements of a 
> “multi-part object”:
> 
> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}  = kMULTIPART
> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}  = “First 100 KB …"
> ...
> 
> You might have figured out something more efficient that saves a KV here but 
> I can’t quite grok it.
> 
> Cheers, Adam
> 
> 
>> On Jan 30, 2019, at 8:24 AM, Jan Lehnardt <j...@apache.org> wrote:
>> 
>> 
>> 
>>> On 30. Jan 2019, at 14:22, Jan Lehnardt <j...@apache.org 
>>> <mailto:j...@apache.org>> wrote:
>>> 
>>> Thanks Ilya for getting this started!
>>> 
>>> Two quick notes on this one:
>>> 
>>> 1. note that JSON does not guarantee object key order and that CouchDB has 
>>> never guaranteed it either, and with say emit(doc.foo, doc.bar), if either 
>>> emit() parameter was an object, the undefined-sort-order of SpiderMonkey 
>>> would mix things up. While worth bringing up, this is not a BC break.
>>> 
>>> 2. This would have the fun property of being able to rename a key inside 
>>> all docs that have that key.
>> 
>> …in one short operation.
>> 
>> Best
>> Jan
>> —
>>> 
>>> Best
>>> Jan
>>> —
>>> 
>>>> On 30. Jan 2019, at 14:05, Ilya Khlopotov <iil...@apache.org> wrote:
>>>> 
>>>> # First proposal
>>>> 
>>>> In order to overcome FoudationDB limitations on key size (10 kB) and value 
>>>> size (100 kB) we could use the following approach.
>>>> 
>>>> Bellow the paths are using slash for illustration purposes only. We can 
>>>> use nested subspaces, tuples, directories or something else. 
>>>> 
>>>> - Store documents in a subspace or directory  (to keep prefix for a key 
>>>> short)
>>>> - When we store the document we would enumerate all field names (0 and 1 
>>>> are reserved) and store the mapping table in the key which look like:
>>>> ```
>>>> {DB_DOCS_NS} / {DOC_KEY} / 0
>>>> ```
>>>> - Flatten the JSON document (convert it into key value pairs where the key 
>>>> is `JSON_PATH` and value is `SCALAR_VALUE`)
>>>> - Replace elements of JSON_PATH with integers from mapping table we 
>>>> constructed earlier
>>>> - When we have array use `1 / {array_idx}`
>>>> - Store scalar values in the keys which look like the following (we use 
>>>> `JSON_PATH` with integers). 
>>>> ```
>>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH}
>>>> ```
>>>> - If the scalar value exceeds 100kB we would split it and store every part 
>>>> under key constructed as:
>>>> ```
>>>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX}
>>>> ```
>>>> 
>>>> Since all parts of the documents are stored under a common `{DB_DOCS_NS} / 
>>>> {DOC_KEY}` they will be stored on the same server most of the time. The 
>>>> document can be retrieved by using range query 
>>>> (`txn.get_range("{DB_DOCS_NS} / {DOC_KEY} / 0", "{DB_DOCS_NS} / {DOC_KEY} 
>>>> / 0xFF")`). We can reconstruct the document since the mapping is returned 
>>>> as well.
>>>> 
>>>> The downside of this approach is we wouldn't be able to ensure the same 
>>>> order of keys in the JSON object. Currently the `jiffy` JSON encoder 
>>>> respects order of keys.
>>>> ```
>>>> 4> jiffy:encode({[{bbb, 1}, {aaa, 12}]}).
>>>> <<"{\"bbb\":1,\"aaa\":12}">>
>>>> 5> jiffy:encode({[{aaa, 12}, {bbb, 1}]}).
>>>> <<"{\"aaa\":12,\"bbb\":1}">>
>>>> ```
>>>> 
>>>> Best regards,
>>>> iilyak
>>>> 
>>>>> On 2019/01/30 13:02:57, Ilya Khlopotov <iil...@apache.org> wrote: 
>>>>> As you might already know the FoundationDB has a number of limitations 
>>>>> which influences the way we might store JSON documents. The limitations 
>>>>> are:
>>>>> 
>>>>> |      limitation             |recommended value|recommended max|absolute 
>>>>> max|
>>>>> |-------------------------|----------------------:|--------------------:|--------------:|
>>>>> | transaction duration  |                              |                  
>>>>>          |      5 sec      |
>>>>> | transaction data size |                              |                  
>>>>>          |      10 Mb     |
>>>>> | key size                   |                 32 bytes |                 
>>>>>   1 kB  |     10 kB      |
>>>>> | value size                |                               |             
>>>>>      10 kB |    100 kB     |
>>>>> 
>>>>> In order to fit the JSON document into 100kB we would have to partition 
>>>>> it in some way. There are three ways of partitioning the document
>>>>> 1. store multiple binary blobs (parts) in different keys
>>>>> 2. flatten JSON structure and store every path leading to a scalar value 
>>>>> under own key
>>>>> 3. measure the size of different branches of a tree representing the JSON 
>>>>> document (while we parse) and use another key for the branch when we 
>>>>> about to exceed the limit
>>>>> 
>>>>> - The first approach is the simplest but it wouldn't allow us to access 
>>>>> parts of the document.
>>>>> - The downsides of a second approach are:
>>>>> - flattened JSON structure would have long paths which means longer keys
>>>>> - the scalar value cannot be more than 100kb (unless we split it as well)
>>>>> - Third approach falls short in cases when the structure of the document 
>>>>> doesn't allow a clean cut off branches:
>>>>> - complex rules to handle all corner cases
>>>>> 
>>>>> The goals of this thread are:
>>>>> - to collect ideas on how to encode and store the JSON document
>>>>> - to comment on the collected ideas
>>>>> 
>>>>> Non goals:
>>>>> - the storage of metadata for the document would be discussed elsewhere
>>>>> - thumb stones
>>>>> - edit conflicts
>>>>> - revisions 
>>>>> 
>>>>> Best regards,
>>>>> iilyak
>>>>> 
>>> 
>>> -- 
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/ 
>> <https://neighbourhood.ie/couchdb-support/>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to