Thanks for this reply. I'm wondering about the same issue... Should I bucket
things into Wide rows (say 10M rows), or narrow (say 10K or 100K)..
Of course it depends on my access patterns right...

Does anyone know if a partial row cache is a feasible feature to implement?
My use case is something like:
I have rows with 10MB / 100K Columns of data. I _typically_ slice from
oldest to newest on the row, and _typically_ only need the first 100 columns
/ 10KB, etc...

If someone went to implement a cache strategy to support this, would they
find it feasible, or difficult/impossible because of <some limitation xyz>

-JD



On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> 2010/10/11 Héctor Izquierdo Seliva <izquie...@strands.com>:
> > Hi everyone.
> >
> > I'm sure this question or similar has come up before, but I can't find a
> > clear answer. I have to store a unknown number of items in cassandra,
> > which can vary from a few hundreds to a few millions per customer.
> >
> > I read that in cassandra wide rows are better than a lot of rows, but
> > then I face two problems. First, column distribution. The only way I can
> > think of distributing items among a given set of rows is hashing the
> > item id to a row id, and the using the item id as the column name. In
> > this way, I can distribute data among a few rows evenly, but If there
> > are only a few items it's equivalent to a row per item plus more
> > overhead, and if there are millions of items then the rows are to big,
> > and I have to turn off row cache. Does anybody knows a way around this?
> >
> > The second issue is that in my benchmarks, once the data is mmapped, one
> > item per row performs faster than wide rows by a significant margin. Is
> > this how it is supposed to be?
> >
> > I can give additional data if needed. English is not my first language
> > so I apologize beforehand is some of this doesn't make sense.
> >
> > Thanks for your time
> >
> >
> If you have wide rows RowCache is a problem. IMHO RowCache is only
> viable in situations where you have a fixed amount of data and thus
> will get a high hit rate. I was running a large row cache for some
> time and I found it unpredictable. It causes memory pressure on the
> JVM from moving things in and out of memory, and if the hit rate is
> low taking a key and all its columns in and out repeatedly ends up
> being counter productive for disk utilization. Suggest KeyCache in
> most situations, (there is a ticket opened for a fractional row cache)
>
> Another factor to consider is if you have many rows and many columns
> you end up with large (er) indexes. In our case we have start up times
> slightly longer then we would like because the process of sampling
> indexes during start up is intensive. If I could do it all over again
> I might serialize more into single columns rather then exploding data
> across multiple rows and columns. If you always need to look up the
> entire row do not break it down by columns.
>
> memory mapping. There are different dynamics depending on data size
> relative to memory size. You may have something like ~ 40GB of data
> and 10GB index, 32GB RAM a node, this system is not going to respond
> the same way with say 200GB data 25 GB Indexes. Also it is very
> workload dependent.
>
> Hope this helps,
> Edward
>

Reply via email to