Re: Document storage

Ben McCann Fri, 30 Mar 2012 10:34:09 -0700

>
> If you don't need selected updates and having something as compact as
> possible on disk make a important difference for you, sure, do use blobs.
> The only argument is that you can already do that without any change to
> the core.



The thing that we can't do today without changes to the core is index on
subparts of some document format like Protobuf/JSON/etc.  If cassandra were
to understand one of these formats, it could remove the need for manual
management of an index.


On Fri, Mar 30, 2012 at 10:23 AM, Sylvain Lebresne <sylv...@datastax.com>wrote:

> On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday
> <daniel.double...@gmx.net> wrote:
> > But decomposing into columns will lead to more of that:
> >
> > - Total amount of serialized data is (in most cases a lot) larger than
> protobuffed / compressed version
>
> At least with sstable compression, I would expect the difference to
> not be too big in practice.
>
> > - If you do selective updates the document will be scattered over
> multiple ssts plus if you do sliced reads you can't optimize reads as
> opposed to the single column version that when updated is automatically
> superseding older versions so most reads will hit only one sst
>
> But if you need to do selective updates, then a blob just doesn't work
> so that comparison is moot.
>
> Now I don't think anyone pretended that you should never use blobs
> (whether that's protobuffed, jsoned, ...). If you don't need selected
> updates and having something as compact as possible on disk make a
> important difference for you, sure, do use blobs. The only argument is
> that you can already do that without any change to the core. What we
> are saying is that for the case where you care more about schema
> flexibility (being able to do selective updates, to index on some
> subpart, etc...) then we think that something like the map and list
> idea of CASSANDRA-3647 will probably be a more natural fit to the
> current CQL API.
>
> --
> Sylvain
>
> >
> > All these reads make the hot dataset. If it fits the page cache your
> fine. If it doesn't you need to buy more iron.
> >
> > Really could not resist because your statement seems to be contrary to
> all our tests / learnings.
> >
> > Cheers,
> > Daniel
> >
> > From dev list:
> >
> > Re: Document storage
> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <d...@venarc.com>
> wrote:
> >>> I think this is a much better approach because that gives you the
> >>> ability to update or retrieve just parts of objects efficiently,
> >>> rather than making column values just blobs with a bunch of special
> >>> case logic to introspect them.  Which feels like a big step backwards
> >>> to me.
> >>
> >> Unless your access pattern involves reading/writing the whole document
> each time. In
> > that case you're better off serializing the whole document and storing
> it in a column as a
> > byte[] without incurring the overhead of column indexes. Right?
> >
> > Hmm, not sure what you're thinking of there.
> >
> > If you mean the "index" that's part of the row header for random
> > access within a row, then no, serializing to byte[] doesn't save you
> > anything.
> >
> > If you mean secondary indexes, don't declare any if you don't want any.
> :)
> >
> > Just telling C* to store a byte[] *will* be slightly lighter-weight
> > than giving it named columns, but we're talking negligible compared to
> > the overhead of actually moving the data on or off disk in the first
> > place.  Not even close to being worth giving up being able to deal
> > with your data from standard tools like cqlsh, IMO.
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of DataStax, the source for professional Cassandra support
> > http://www.datastax.com
> >
>

Re: Document storage

Reply via email to