Re: Cassandra disk usage and failure recovery

2011-01-04 Thread Peter Schuller
> That is correct.  In 0.6, an anticompaction was performed and a temporary
> SSTable was written out to disk, then streamed to the recipient.  The way
> this is now done in 0.7 requires no extra disk space on the source node.

Great. So that should at least mean that running out of diskspace
should always be solvable in terms of the cluster by adding news in
between other pre-existing nodes. That is provided that the internal
node issues (allowing compaction to take place to space can actually
be freed) are solved in some way. And assuming the compaction redesign
happens at some point, that should be a minor issue because it should
be easy to avoid the possibility of getting to the point of not
fitting a single (size limited) sstable.

-- 
/ Peter Schuller


Re: Cassandra disk usage and failure recovery

2011-01-04 Thread Tyler Hobbs
> Anti-compaction and streaming is done to move data from nodes that
> have it (that are in the replica set). This implies CPU and I/O and
> networking load on the source node, so it does have an impact. See
> http://wiki.apache.org/cassandra/Streaming among others.
>
> (Here's where I'm not sure, someone please confirm/deny) In 0.6, I
> believe this required diskspace on the originating node. In 0.7, I
> *think* the need for disk space on the source node is removed.
>

That is correct.  In 0.6, an anticompaction was performed and a temporary
SSTable was written out to disk, then streamed to the recipient.  The way
this is now done in 0.7 requires no extra disk space on the source node.

- Tyler


Re: Cassandra disk usage and failure recovery

2011-01-04 Thread Peter Schuller
This will be a very selective response, not at all as exhaustive as it
should be to truly cover what you bring up. Sorry, but here goes some
random tidbits.

> On the cassandra user list, I noticed a thread on a user that literally
> wrote his cluster to death.  Correct me if I'm wrong, but based on that
> thread, it seemed like if one node runs out of space, it kills the
> cluster.  I hope that is not the case.

A single node running out of disk doesn't kill a cluster as such.
However, if you have a cluster of nodes where each is roughly the same
size (and assuming even balancing of data) and you are so close to
being full everywhere that one of them actually becomes full, you are
in dangerous territory if you're worried about uptime.

I'd say there are two main types of issues here:

(1) Monitoring and predicting disk space usage with Cassandra is a bit
more difficult than in a traditional system because of the way it
varies over time and the way writes can have impacts on disk space
that last for some time beyond the write itself. I.e., delayed
effects.

(2) The fact that operational things like moving nodes and repair
operations need disk space, means it can be more difficult to get out
of a bad situation.

> If I have a cluster of 3 computers that is filling up in disk space and
> I add a bunch of machines, how does cassandra deal with that situation?
> Does cassandra know not to keep writing to the initial three machines
> because the disk is near full and write to the new machines only?  At
> some point my machines are going to be full of data and I don't want to
> write any more data to those machines until I add more storage to those
> machines.  Is this possible to do?

Cassandra selects replica placement based on its replication strategy
and ring layout (node's and their tokens). There is no "fallback" to
putting data elsewhere because nodes are full (doing so would likely
introduce tremendous complexity I think).

If a node goes catastrophically out of disk, it will probably stop
working and effectively be offline in the cluster. Not necessarily
though; it could be that e.g. compactions are not completing but
writes still work as do reads, and over time reads become less
performant due to lack of compaction. Alternatively if disks are
completely full and memtables cannot be flushed, I believe you'd
expect it to essentially go down.

I would say that running out of disk space is not something which is
very gracefully handled right now, although there is some code to
mitigate it. For example during compaction there is some logic to
check for disk space availability and try to compact smaller files
first to possibly make room for larger compactions, but that is not
addressing the overall issue.

>  I read somewhere that it is a bad
> idea to use more than 50% of the drive utilization because of the
> compaction requirements.

For a single column family, compaction (when major, i.e., all data is
involved) compactions can currently double disk space if data is not
overwritten or removed. In addition things like nodetool repair and
nodetool cleanup need disk space too.

Here's where I'm not really up on all the details. It would be nice to
arrive at a figure which is the absolute worst-case possible disk
space expansion that is possible.

> I was planning on using ssd for the sstables
> and standard hard drives for the compaction and writelogs.  Is this
> possible to do?

It doesn't make sense for compaction because compaction involves
replacing sstables with new ones. You could write new sstables to
separate storage and then copy them into the sstable storage location,
but that doesn't really do anything but increase the total amount of
I/O that you're doing.

>I didn't see any options for specifying where the write
> logs are written and compaction is done.

There is a commitlog_directory option. Alternatively the directory
might be a symlink.

Compaction is not in a separate location (see above).

>Also, is it possible to add
> more drives to a running machine and have cassandra utilize a mounted
> directory with free space?

Sort of, but not really the way you want it. Short version is,
"pretend the answer is no". The real answer is somewhere between yes
and no (if someone feels like doing a write-up).

I would suggest volume growth and file system resize if that is
something you plan on doing. Assuming you have a setup where such
operations are sufficiently trusted. Probably RAID + LVM. But beware
of LVM and implications on correctness (e.g., write barriers). Err..
basically, no, I don't recommend planning to fiddle with that. The
potential for problems is probably too high. KISS.

> Cassandra should know to stop writing to the node once the directory
> with the sstables is near full, but it doesn't seem like it does
> anything like that.

It's kind of non-trivial to gracefully handle it in a way that both
satisfies the "I don't want to be stuck" requirement while also
satisfying the "I don'