On Tue, Jan 28, 2014 at 7:57 AM, Robert Wille <rwi...@fold3.com> wrote:

> I have a dataset which is heavy on updates. The updates are actually
> performed by inserting new records and deleting the old ones the following
> day. Some records might be updated (replaced) a thousand times before they
> are finished.
>

Perhaps a log structured database with immutable data files is not best
suited for this use case?

Are you deleting rows or columns each day?

As I watch SSTables get created and compacted on my staging server (I
> haven't gone live with this yet), it appears that if I let the compactor do
> its default behavior, I'll probably end up consuming several times the
> amount of disk space as is actually required. I probably need to
> periodically trigger a major compaction if I want to avoid that. However,
> I've read that major compactions aren't really recommended. I'd like to get
> people's take on this. I'd also be interested in people's recommendations
> on compaction strategy and other compaction-related configuration settings.
>

This is getting to be a FAQ... but... briefly..

1) yes, you are correct about the amount of space waste. this is why most
people avoid write patterns with lots of overwrite.
2) the docs used to say something incoherent about major compactions, but
suffice it to say that running them regularly is often a viable solution.
they are the optimal way cassandra has available to it to merge data.
3) if you really have some problem related to your One Huge SSTable, you
can always use sstablesplit to split it into N smaller ones.
4) if you really don't want to run a major compaction, you can either use
Level compaction (which has its own caveats) or use checksstablegarbage [1]
and UserDefinedCompaction to strategically manually compact SSTables.

=Rob
[1] https://github.com/cloudian/support-tools#checksstablegarbage

Reply via email to