Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Michael Fair Wed, 30 Jan 2019 15:24:25 -0800

On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org wrote:


> Hi Michael,
>
> > The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
>
> Yes that’s definitely one way to address storage of edit conflicts. I
> think there are other, more compact representations that we can explore if
> we have this “exploded” data model where each scalar value maps to an
> individual KV pair.


I agree, as I mentioned on the original thread, I see a scheme, that
handles both conflicts and revisions, where you only have to store the most
recent change to a field.  Like you suggested, multiple revisions can share
a key.  Which in my mind's eye further begs the conflicts/revisions
discussion along with the working within the limits discussion because it
seems to me they are all intrinsically related as a "feature".

Saying 'We'll break documents up into roughly 80k segments', then trying to
overlay some kind of field sharing scheme for revisions/conflicts doesn't
seem like it will work.

I probably should have left out the trivial fix proposal as I don't think
it's a feasible solution to actually use.

The comment is more regarding that I do not see how this thread can escape
including how to store/retrieve conflicts/revisions.

For instance, the 'doc as individual fields' proposal lends itself to value
sharing across mutiple documents (and I don't just mean revisions of the
same doc, I mean the same key/value instance could be shared for every
document).
However that's not really relevant if we're not considering the amount of
shared information across documents in the storage scheme.

Simply storing documents in <100k segments (perhaps in some kind of
compressed binary representation) to deal with that FDB limit seems fine.
The only reason to consider doing something else is because of its impact
to indexing, searches, reduce functions, revisions, on-disk size impact,
etc.



> > I'm assuming the process will flatten the key paths of the document into
> an array and then request the value of each key as multiple parallel
> queries against FDB at once
>
> Ah, I think this is not one of Ilya’s assumptions. He’s trying to design a
> model which allows the retrieval of a document with a single range read,
> which is a good goal in my opinion.
>

I am not sure I agree.

Think of bitTorrent, a single range read should pull back the structure of
the document (the pieces to fetch), but not necessarily the whole document.

What if you already have a bunch of pieces in common with other documents
locally (a repeated header/footer/ or type for example); and you only need
to get a few pieces of data you don't already have?

The real goal to Couch I see is to treat your document set like the
collection of structured information that it is.  In some respects like an
extension of your application's heap space for structured objects and
efficiently querying that collection to get back subsets of the data.

Otherwise it seems more like a slightly upgraded file system plus a fancy
grep/find like feature...

The best way I see to unlock more features/power is to a move towards a
more granular and efficient way to store and retrieve the scalar values...



For example, hears a crazy thought:
Map every distinct occurence of a key/value instance through a crypto hash
function to get a set of hashes.

These can be be precomputed by Couch without any lookups in FDB.  These
will be spread all over kingdom come in FDB and not lend themselves to
range search well.

So what you do is index them for frequency of occurring in the same set.
In essence, you 'bucket them' statistically, and that bucket id becomes a
key prefix. A crypto hash value can be copied into more than one bucket.
The {bucket_id}/{cryptohash} becomes a {val_id}

When writing a document, Couch submits the list/array of cryptohash values
it computed to FDB and gets back the corresponding  {val_id} (the id with
the bucket prefixed).  This can get somewhat expensive if there's always a
lot of app local cache misses.


A document's value is then a series of {val_id} arrays up to 100k per
segment.

When retrieving a document, you get the val_ids, find the distinct buckets
and min/max entries for this doc, and then parallel query each bucket while
reconstructing the document.

The values returned from the buckets query are the key/value strings
required to reassemble this document.


----------
I put this forward primarily to hilite the idea that trying to match the
storage representation of documents in a straight forward way to FDB keys
to reduce query count might not be the most performance oriented approach.

I'd much prefer a storage approach that reduced data duplication and
enabled fast sub-document queries.


This clearly falls in the realm of what people want the 'use case' of Couch
to be/become.  By giving Couch more access to sub-document queries, I could
eventually see queries as complicated as GraphQL submitted to Couch and
pulling back ad-hoc aggregated data across multiple documents in a single
application layer request.

Hehe - one way to look at the database of Couch documents is that they are
all conflict revisions of the single root empty document.   What I mean be
this is consider thinking of the entire document store as one giant DAG of
key/value pairs. How even separate documents are still typically related to
each other.  For most applications there is a tremendous amount of data
redundancy between docs and especially between revisions of those docs...



And all this is a long way of saying "I think there could be a lot of value
in assuming documents are 'assembled' from multiple queries to FDB, with
local caching, instead of simply retrieved"

Thanks, I hope I'm not the only outlier here thinking this way!?

Mike :-)

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to