> I was kind of expecting to see some magic value indicating that the > subsequent set of keys with the same prefix are all elements of a “multi-part > object” I missed this aspect. This is easy to solve (as you've mentioned) by using either a special character or reserved value in the mapping table.
On 2019/01/30 16:58:29, Adam Kocoloski <kocol...@apache.org> wrote: > Jan, I don’t think it does have that "fun property #2", as the mapping is > created separately for each document. In this proposal the field name “foo” > could map to 2 in one document and 42 in another. > > Thanks for the proposal Ilya. Personally I wonder if the 10KB limit on field > paths is anything more than a theoretical concern. It’s hard for me to > imagine a useful schema that would get anywhere near that deep, but maybe I’m > insufficiently creative :) There’s certainly a storage overhead from > repeating the upper portion of a path over and over again, but that’s also > something the storage engine can optimize away through prefix elision. The > current production storage engine in FoundationDB does not do this elision, > but the new one in development does. > > The value size limit is probably not so theoretical. I think as a project we > could choose to impose a 100KB size limit on scalar values - a user who had a > string longer than 100KB could chunk it up into an array of strings pretty > easily to work around that limit. But let’s say we don’t want to impose that > limit. In your design, how do I distinguish {PART_IDX} from the elements of > the {JSON_PATH}? I was kind of expecting to see some magic value indicating > that the subsequent set of keys with the same prefix are all elements of a > “multi-part object”: > > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} = kMULTIPART > {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX} = “First 100 KB …" > ... > > You might have figured out something more efficient that saves a KV here but > I can’t quite grok it. > > Cheers, Adam > > > > On Jan 30, 2019, at 8:24 AM, Jan Lehnardt <j...@apache.org> wrote: > > > > > > > >> On 30. Jan 2019, at 14:22, Jan Lehnardt <j...@apache.org > >> <mailto:j...@apache.org>> wrote: > >> > >> Thanks Ilya for getting this started! > >> > >> Two quick notes on this one: > >> > >> 1. note that JSON does not guarantee object key order and that CouchDB has > >> never guaranteed it either, and with say emit(doc.foo, doc.bar), if either > >> emit() parameter was an object, the undefined-sort-order of SpiderMonkey > >> would mix things up. While worth bringing up, this is not a BC break. > >> > >> 2. This would have the fun property of being able to rename a key inside > >> all docs that have that key. > > > > …in one short operation. > > > > Best > > Jan > > — > >> > >> Best > >> Jan > >> — > >> > >>> On 30. Jan 2019, at 14:05, Ilya Khlopotov <iil...@apache.org> wrote: > >>> > >>> # First proposal > >>> > >>> In order to overcome FoudationDB limitations on key size (10 kB) and > >>> value size (100 kB) we could use the following approach. > >>> > >>> Bellow the paths are using slash for illustration purposes only. We can > >>> use nested subspaces, tuples, directories or something else. > >>> > >>> - Store documents in a subspace or directory (to keep prefix for a key > >>> short) > >>> - When we store the document we would enumerate all field names (0 and 1 > >>> are reserved) and store the mapping table in the key which look like: > >>> ``` > >>> {DB_DOCS_NS} / {DOC_KEY} / 0 > >>> ``` > >>> - Flatten the JSON document (convert it into key value pairs where the > >>> key is `JSON_PATH` and value is `SCALAR_VALUE`) > >>> - Replace elements of JSON_PATH with integers from mapping table we > >>> constructed earlier > >>> - When we have array use `1 / {array_idx}` > >>> - Store scalar values in the keys which look like the following (we use > >>> `JSON_PATH` with integers). > >>> ``` > >>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} > >>> ``` > >>> - If the scalar value exceeds 100kB we would split it and store every > >>> part under key constructed as: > >>> ``` > >>> {DB_DOCS_NS} / {DOC_KEY} / {JSON_PATH} / {PART_IDX} > >>> ``` > >>> > >>> Since all parts of the documents are stored under a common `{DB_DOCS_NS} > >>> / {DOC_KEY}` they will be stored on the same server most of the time. The > >>> document can be retrieved by using range query > >>> (`txn.get_range("{DB_DOCS_NS} / {DOC_KEY} / 0", "{DB_DOCS_NS} / {DOC_KEY} > >>> / 0xFF")`). We can reconstruct the document since the mapping is returned > >>> as well. > >>> > >>> The downside of this approach is we wouldn't be able to ensure the same > >>> order of keys in the JSON object. Currently the `jiffy` JSON encoder > >>> respects order of keys. > >>> ``` > >>> 4> jiffy:encode({[{bbb, 1}, {aaa, 12}]}). > >>> <<"{\"bbb\":1,\"aaa\":12}">> > >>> 5> jiffy:encode({[{aaa, 12}, {bbb, 1}]}). > >>> <<"{\"aaa\":12,\"bbb\":1}">> > >>> ``` > >>> > >>> Best regards, > >>> iilyak > >>> > >>> On 2019/01/30 13:02:57, Ilya Khlopotov <iil...@apache.org> wrote: > >>>> As you might already know the FoundationDB has a number of limitations > >>>> which influences the way we might store JSON documents. The limitations > >>>> are: > >>>> > >>>> | limitation |recommended value|recommended > >>>> max|absolute max| > >>>> |-------------------------|----------------------:|--------------------:|--------------:| > >>>> | transaction duration | | > >>>> | 5 sec | > >>>> | transaction data size | | > >>>> | 10 Mb | > >>>> | key size | 32 bytes | > >>>> 1 kB | 10 kB | > >>>> | value size | | > >>>> 10 kB | 100 kB | > >>>> > >>>> In order to fit the JSON document into 100kB we would have to partition > >>>> it in some way. There are three ways of partitioning the document > >>>> 1. store multiple binary blobs (parts) in different keys > >>>> 2. flatten JSON structure and store every path leading to a scalar value > >>>> under own key > >>>> 3. measure the size of different branches of a tree representing the > >>>> JSON document (while we parse) and use another key for the branch when > >>>> we about to exceed the limit > >>>> > >>>> - The first approach is the simplest but it wouldn't allow us to access > >>>> parts of the document. > >>>> - The downsides of a second approach are: > >>>> - flattened JSON structure would have long paths which means longer keys > >>>> - the scalar value cannot be more than 100kb (unless we split it as well) > >>>> - Third approach falls short in cases when the structure of the document > >>>> doesn't allow a clean cut off branches: > >>>> - complex rules to handle all corner cases > >>>> > >>>> The goals of this thread are: > >>>> - to collect ideas on how to encode and store the JSON document > >>>> - to comment on the collected ideas > >>>> > >>>> Non goals: > >>>> - the storage of metadata for the document would be discussed elsewhere > >>>> - thumb stones > >>>> - edit conflicts > >>>> - revisions > >>>> > >>>> Best regards, > >>>> iilyak > >>>> > >> > >> -- > >> Professional Support for Apache CouchDB: > >> https://neighbourhood.ie/couchdb-support/ > >> > > > > -- > > Professional Support for Apache CouchDB: > > https://neighbourhood.ie/couchdb-support/ > > <https://neighbourhood.ie/couchdb-support/> >