Re: Wide rows or tons of rows?

Aaron Morton Mon, 11 Oct 2010 14:33:18 -0700

No idea about a partial row cache, but I would start with fat rows in your use case. If you find that performance is really a problem then you could add a second "recent / oldest" CF that you maintain with the most recent entries and use the row cache there. OR add more nodes.

Aaron

On 12 Oct, 2010,at 10:08 AM, Jeremy Davis <jerdavis.cassan...@gmail.com> wrote:

Thanks for this reply. I'm wondering about the same issue... Should I bucket things into Wide rows (say 10M rows), or narrow (say 10K or 100K)..
Of course it depends on my access patterns right...

Does anyone know if a partial row cache is a feasible feature to implement? My use case is something like:
I have rows with 10MB / 100K Columns of data. I _typically_ slice from oldest to newest on the row, and _typically_ only need the first 100 columns / 10KB, etc...

If someone went to implement a cache strategy to support this, would they find it feasible, or difficult/impossible because of <some limitation xyz>

-JD

On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
2010/10/11 Héctor Izquierdo Seliva <izquie...@strands.com>:

> Hi everyone.
>
> I'm sure this question or similar has come up before, but I can't find a
> clear answer. I have to store a unknown number of items in cassandra,
> which can vary from a few hundreds to a few millions per customer.
>
> I read that in cassandra wide rows are better than a lot of rows, but
> then I face two problems. First, column distribution. The only way I can
> think of distributing items among a given set of rows is hashing the
> item id to a row id, and the using the item id as the column name. In
> this way, I can distribute data among a few rows evenly, but If there
> are only a few items it's equivalent to a row per item plus more
> overhead, and if there are millions of items then the rows are to big,
> and I have to turn off row cache. Does anybody knows a way around this?
>
> The second issue is that in my benchmarks, once the data is mmapped, one
> item per row performs faster than wide rows by a significant margin. Is
> this how it is supposed to be?
>
> I can give additional data if needed. English is not my first language
> so I apologize beforehand is some of this doesn't make sense.
>
> Thanks for your time
>
>

If you have wide rows RowCache is a problem. IMHO RowCache is only
viable in situations where you have a fixed amount of data and thus
will get a high hit rate. I was running a large row cache for some
time and I found it unpredictable. It causes memory pressure on the
JVM from moving things in and out of memory, and if the hit rate is
low taking a key and all its columns in and out repeatedly ends up
being counter productive for disk utilization. Suggest KeyCache in
most situations, (there is a ticket opened for a fractional row cache)

Another factor to consider is if you have many rows and many columns
you end up with large (er) indexes. In our case we have start up times
slightly longer then we would like because the process of sampling
indexes during start up is intensive. If I could do it all over again
I might serialize more into single columns rather then exploding data
across multiple rows and columns. If you always need to look up the
entire row do not break it down by columns.

memory mapping. There are different dynamics depending on data size
relative to memory size. You may have something like ~ 40GB of data
and 10GB index, 32GB RAM a node, this system is not going to respond
the same way with say 200GB data 25 GB Indexes. Also it is very
workload dependent.

Hope this helps,
Edward

Re: Wide rows or tons of rows?

Reply via email to