On Wed, Dec 18, 2013 at 5:39 PM, Volker Mische <volker.mis...@gmail.com>wrote:

> On 12/03/2013 07:12 PM, Benoit Chesneau wrote:
> > On Tue, Dec 3, 2013 at 3:01 PM, Benjamin Young <byo...@bigbluehat.com
> >wrote:
> >
> >> Hi all,
> >>
> >> Recently the "doc._*" reservation has been causing me trouble when
> pulling
> >> in "arbitrary" JSON from various sources that also use the underscore
> >> prefixed names for things (HAL [1], vnd.error [2], other APIs). I've
> also
> >> hit the wall several times when trying to import filesystem contents
> >> (Sphinx, ghpages, and the like) that use _* prefixing for their "special
> >> folders."
> >>
> >> As such, I'd like to propose the following:
> >> 1. Begin storing new reserved terms in doc._.* (rather than doc._*).
> >>  - this gives developers one object to look into for the meta-data
> about a
> >> doc
> >>  - you can see the scope creep of our current doc._* best in the
> >> replicator status messages.
> >>     - doc._ replication_* would become doc._.replication.*
> >> 2. Move "magic" API endpoints under "/_/" term as well (for the sake of
> >> attachments.
> >>  - _design/doc would stay the same
> >>  - but the child endpoints would live under "_design/doc/_/*"
> >>     - _design/doc/_/view/by_date
> >>     - _design/doc/_/list/by_date/ul
> >>     - _design/doc/_/rewrite
> >>
> >> I realize these are extreme API shifts, and would need to wait for
> CouchDB
> >> 2.0.
> >>
> >> The first steps this direction would be to put new reserved word keys
> into
> >> a "doc._.*" namespace going forward. Closer to the "cut over" for 2.0
> >> duplicates of the existing keys (doc._id, doc._rev, especially) could
> also
> >> live at their new underscore prefixed names (doc._.id, doc._.rev) which
> >> would give devs a chance to migrate code and content.
> >>
> >> Doing this would:
> >> 1. Give us "limitless" space to add content.
> >> 2. Encourage a namespacing pattern for things like doc._.replication.*
> or
> >> other logically grouped content.
> >> 3. Free up CouchDB to accept a far broader range of content and remove
> the
> >> "hey, you can't put that there! I was here first!" errors. :)
> >>
> >> Thanks for considering this,
> >> Benjamin
> >>
> >> [1] http://stateless.co/hal_specification.html
> >> [2] https://github.com/blongden/vnd.error
> >>
> >
> > I don't see why couchdb should adapt itself to newer things that didn't
> > take care of an older API when doing their stuff but that's probably
> > another concern ;)
> >
> > I would find a "/_/" in the URL rather ugly and not needed in that case.
> > Same for having a _ in a doc.  also it doesn't have much sense. Why do
> you
> > want to change the HTTP api at that level?
> >
> > Another way to do it and probably more restish woudl be moving all
> couchdb
> > resources in their own namespace. Say `couchdb/` for example. so anything
> > in the resource couchdb will be related to couchdb.
> >
> > Next is the the prefix "_" in the doc. It's actually reserved because
> > sometimes, once day we will add other metadata which is fine. But raises
> > the issue you have.
> >
> > If I summarise the discussion here amd precedent discussions there are
> > different school there:
> >
> > - remove the metadata from the doc and put them in headers or aside. I
> > quite like the first solution, though it may be a problem behind some
> > proxies, or with the header length (especially for json values). Also
> > headers are supposed to be in latin1 in a lot of clients...
> > - put the metadata in their own namespace which is what you propose.
> >
> > I dislike the last solution. Mostly because it would force the clients to
> > wait this namespace to read the metadata while parsing the JSON (which
> > could be when streaming it). Instead I would prefer to keep them at the
> > first level and due the reverse: put the data in their own namespace, say
> > `_data`. This allows any clients to ignore this layer if needed while
> > parsing the JSON and get it directly (without parsing  then). The
> metadata
> > should be the first citizem imo. Optionally we could add some new
> > parameters to the doc api allowing someone to only fetch the metadata,
> > etc.. Also couchdb could also parse the coming doc and stop to parse the
> > json when seeing this property and store it directly. It is also
> following
> > the logic of attachments somehow. Another things that could be done at
> the
> > api level is having smth like `/db/docid/_data` which would allows you to
> > only retrieve the data instead of using a show function.
> >
> > What do you think?
> >
> > - benoit
>
> Hi all,
>
> I've been talking with Benoit about this at the CouchHack. I think his
> proposal makes a lot of sense. Let's take the separation of meta and the
> document body (as I proposed) together with what Benoit said.
>
> When storing the actual data in a top-level property called "_data", you
> could easily extract the meta information, without parsing the body at
> all. You just need to parse all the top level properties (which you need
> to do anyway as JSON doesn't have any distinct sorting).
>
> Having this could be a great first step towards making meta and document
> body separation easier to implement.
>
> In a next step you could then e.g. provide an API where you just send
> the document body, with the meta as headers.
>
> Cheers,
>   Volker
>
>

I recently started a new project where having  the metadata and the content
separated would make a lot of sense. Here is a quick summary in vrac of my
thinking about it.

- With our current concurrency model, it makes sense to have the metadata
coming with the document. Having them coming in a separate commit/doc would
create a lot of problems in a distributed environment (what happen when a
doc is edited on 2 places and the metadata updated apart). Our revision
model is here to solve such things.

- It would be interesting to let the user set its own metadata coming with
a document. We could imagine someone adding timestamps, the other adding
authentication infos, .. Some metadata could also be hidden to the user
that replicate or fetch the doc. Metadata should be really thought a
description of the doc and the way it will be shared/stored, nothing much.
Ie. mainly used for internal purposes and some could be local to a node. It
also answers to the original problem that raised this thread: we could
design some entry points in the api that only return the body of the doc
(without its metadata) so the clients would be happy with it. Or such thing.

- I like attachments. Transforming couchdb in another object database (aka
blob store) is not really that interesting neither innovative. At the end,
most of the users of  the blob storages are also using a database to index
the objects. Where in an attachment model we are attaching a blob to its
structured description in the doc. Such description can then be indexed
using the views. I think i the future couchdb should consider attachments
as links attached to a doc. Such link could be internal like it is now but
also external. The remote link would be transparently handled for the one
that replicate, but at the end we could eventually attach a blob from an
external source. We could also link another doc.... (digression spotted)

- About metadata sent outside the JSON or as an header, I have a preference
for having them in the JSON sent to couchdb. Mainly because we could then
support any other protocol than HTTP without having to support different
ways to read the metadata coming with the document. Someone that want to
just use TCP to pass the doc could then just handle the transport logic and
give to couchdb the JSON which will be then indexed. Where in other cases
the transport will also need to manage how it get the metadata.

- If we have the metadata in a JSON, like VMX already told,  it's quite
more efficient to have the metadata at the first level and make the content
available in a `_data` (or `_body`) property. We could then parse the JSON
to fetch all the metadata and omit the `_data` member which will be then
stored on disk. Doing the other way (Having metadata in a `_meta` property)
wouldn't be efficient at all due to the nature of a JSON: there is no
guaranty about the order of the properties. Also we will have generally a
content bigger than the metadata. (most docs will only have the `_id` and
`_rev`).

After looking at the code , I don't think we need a lot of changes to
support a system that use a JSON witch separate the metadata from the
content of the doc. If we are OK with that I could provide quickly a patch
which introduce that change. For the compatibility it could also parse the
full doc when no `_data`  (or `_body`) member is found and make it
transparent for the end user by reading an api version that could come in
the headers.

Anyway this are just my 2 cents on that topic. I would be more than happy
to discuss this topic further so we could introduce such changes rapidly in
our API (even before any merge possibly).

- benoit

Reply via email to