> > If you don't need selected updates and having something as compact as > possible on disk make a important difference for you, sure, do use blobs. > The only argument is that you can already do that without any change to > the core.
The thing that we can't do today without changes to the core is index on subparts of some document format like Protobuf/JSON/etc. If cassandra were to understand one of these formats, it could remove the need for manual management of an index. On Fri, Mar 30, 2012 at 10:23 AM, Sylvain Lebresne <sylv...@datastax.com>wrote: > On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday > <daniel.double...@gmx.net> wrote: > > But decomposing into columns will lead to more of that: > > > > - Total amount of serialized data is (in most cases a lot) larger than > protobuffed / compressed version > > At least with sstable compression, I would expect the difference to > not be too big in practice. > > > - If you do selective updates the document will be scattered over > multiple ssts plus if you do sliced reads you can't optimize reads as > opposed to the single column version that when updated is automatically > superseding older versions so most reads will hit only one sst > > But if you need to do selective updates, then a blob just doesn't work > so that comparison is moot. > > Now I don't think anyone pretended that you should never use blobs > (whether that's protobuffed, jsoned, ...). If you don't need selected > updates and having something as compact as possible on disk make a > important difference for you, sure, do use blobs. The only argument is > that you can already do that without any change to the core. What we > are saying is that for the case where you care more about schema > flexibility (being able to do selective updates, to index on some > subpart, etc...) then we think that something like the map and list > idea of CASSANDRA-3647 will probably be a more natural fit to the > current CQL API. > > -- > Sylvain > > > > > All these reads make the hot dataset. If it fits the page cache your > fine. If it doesn't you need to buy more iron. > > > > Really could not resist because your statement seems to be contrary to > all our tests / learnings. > > > > Cheers, > > Daniel > > > > From dev list: > > > > Re: Document storage > > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <d...@venarc.com> > wrote: > >>> I think this is a much better approach because that gives you the > >>> ability to update or retrieve just parts of objects efficiently, > >>> rather than making column values just blobs with a bunch of special > >>> case logic to introspect them. Which feels like a big step backwards > >>> to me. > >> > >> Unless your access pattern involves reading/writing the whole document > each time. In > > that case you're better off serializing the whole document and storing > it in a column as a > > byte[] without incurring the overhead of column indexes. Right? > > > > Hmm, not sure what you're thinking of there. > > > > If you mean the "index" that's part of the row header for random > > access within a row, then no, serializing to byte[] doesn't save you > > anything. > > > > If you mean secondary indexes, don't declare any if you don't want any. > :) > > > > Just telling C* to store a byte[] *will* be slightly lighter-weight > > than giving it named columns, but we're talking negligible compared to > > the overhead of actually moving the data on or off disk in the first > > place. Not even close to being worth giving up being able to deal > > with your data from standard tools like cqlsh, IMO. > > > > -- > > Jonathan Ellis > > Project Chair, Apache Cassandra > > co-founder of DataStax, the source for professional Cassandra support > > http://www.datastax.com > > >