Thanks for this reply. I'm wondering about the same issue... Should I bucket things into Wide rows (say 10M rows), or narrow (say 10K or 100K).. Of course it depends on my access patterns right...
Does anyone know if a partial row cache is a feasible feature to implement? My use case is something like: I have rows with 10MB / 100K Columns of data. I _typically_ slice from oldest to newest on the row, and _typically_ only need the first 100 columns / 10KB, etc... If someone went to implement a cache strategy to support this, would they find it feasible, or difficult/impossible because of <some limitation xyz> -JD On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > 2010/10/11 Héctor Izquierdo Seliva <izquie...@strands.com>: > > Hi everyone. > > > > I'm sure this question or similar has come up before, but I can't find a > > clear answer. I have to store a unknown number of items in cassandra, > > which can vary from a few hundreds to a few millions per customer. > > > > I read that in cassandra wide rows are better than a lot of rows, but > > then I face two problems. First, column distribution. The only way I can > > think of distributing items among a given set of rows is hashing the > > item id to a row id, and the using the item id as the column name. In > > this way, I can distribute data among a few rows evenly, but If there > > are only a few items it's equivalent to a row per item plus more > > overhead, and if there are millions of items then the rows are to big, > > and I have to turn off row cache. Does anybody knows a way around this? > > > > The second issue is that in my benchmarks, once the data is mmapped, one > > item per row performs faster than wide rows by a significant margin. Is > > this how it is supposed to be? > > > > I can give additional data if needed. English is not my first language > > so I apologize beforehand is some of this doesn't make sense. > > > > Thanks for your time > > > > > If you have wide rows RowCache is a problem. IMHO RowCache is only > viable in situations where you have a fixed amount of data and thus > will get a high hit rate. I was running a large row cache for some > time and I found it unpredictable. It causes memory pressure on the > JVM from moving things in and out of memory, and if the hit rate is > low taking a key and all its columns in and out repeatedly ends up > being counter productive for disk utilization. Suggest KeyCache in > most situations, (there is a ticket opened for a fractional row cache) > > Another factor to consider is if you have many rows and many columns > you end up with large (er) indexes. In our case we have start up times > slightly longer then we would like because the process of sampling > indexes during start up is intensive. If I could do it all over again > I might serialize more into single columns rather then exploding data > across multiple rows and columns. If you always need to look up the > entire row do not break it down by columns. > > memory mapping. There are different dynamics depending on data size > relative to memory size. You may have something like ~ 40GB of data > and 10GB index, 32GB RAM a node, this system is not going to respond > the same way with say 200GB data 25 GB Indexes. Also it is very > workload dependent. > > Hope this helps, > Edward >