Hi Nate,
Are you using incremental backups?
Extract from the documentation (
http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_backup_incremental_t.html
):
/When incremental backups are enabled (disabled by default), Cassandra
hard-links each flushed SSTable to a backups directory under the
keyspace data directory. This allows storing backups offsite without
transferring entire snapshots. Also, incremental backups combine with
snapshots to provide a dependable, up-to-date backup mechanism./
//
/As with snapshots, Cassandra does not automatically clear incremental
backup files. *DataStax recommends setting up a process to clear
incremental backup hard-links each time a new snapshot is created.*/
These backups are stored in directories named "backups" at the same
level as the "snapshots' directories.
Reynald
On 09/12/2014 18:13, Nate Yoder wrote:
Thanks for the advice. Totally makes sense. Once I figure out how to
make my data stop taking up more than 2x more space without being
useful I'll definitely make the change :)
Nate
--
*Nathanael Yoder*
Principal Engineer & Data Scientist, Whistle
415-944-7344 // n...@whistle.com <mailto:n...@whistle.com>
On Tue, Dec 9, 2014 at 9:02 AM, Jonathan Haddad <j...@jonhaddad.com
<mailto:j...@jonhaddad.com>> wrote:
Well, I personally don't like RF=2. It means if you're using
CL=QUORUM and a node goes down, you're going to have a bad time.
(downtime) If you're using CL=ONE then you'd be ok. However, I am
not wild about losing a node and having only 1 copy of my data
available in prod.
On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder <n...@whistle.com
<mailto:n...@whistle.com>> wrote:
Thanks Jonathan. So there is nothing too idiotic about my
current set-up with 6 boxes each with 256 vnodes each and a RF
of 2?
I appreciate the help,
Nate
--
*Nathanael Yoder*
Principal Engineer & Data Scientist, Whistle
415-944-7344 // n...@whistle.com <mailto:n...@whistle.com>
On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad
<j...@jonhaddad.com <mailto:j...@jonhaddad.com>> wrote:
You don't need a prime number of nodes in your ring, but
it's not a bad idea to it be a multiple of your RF when
your cluster is small.
On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder
<n...@whistle.com <mailto:n...@whistle.com>> wrote:
Hi Ian,
Thanks for the suggestion but I had actually already
done that prior to the scenario I described (to get
myself some free space) and when I ran nodetool
cfstats it listed 0 snapshots as expected, so
unfortunately I don't think that is where my space went.
One additional piece of information I forgot to point
out is that when I ran nodetool status on the node it
included all 6 nodes.
I have also heard it mentioned that I may want to have
a prime number of nodes which may help protect against
split-brain. Is this true? If so does it still apply
when I am using vnodes?
Thanks again,
Nate
--
*Nathanael Yoder*
Principal Engineer & Data Scientist, Whistle
415-944-7344 // n...@whistle.com <mailto:n...@whistle.com>
On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose
<ianr...@fullstory.com <mailto:ianr...@fullstory.com>>
wrote:
Try `nodetool clearsnapshot` which will delete any
snapshots you have. I have never taken a snapshot
with nodetool yet I found several snapshots on my
disk recently (which can take a lot of space). So
perhaps they are automatically generated by some
operation? No idea. Regardless, nuking those
freed up a ton of space for me.
- Ian
On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder
<n...@whistle.com <mailto:n...@whistle.com>> wrote:
Hi All,
I am new to Cassandra so I apologise in
advance if I have missed anything obvious but
this one currently has me stumped.
I am currently running a 6 node Cassandra
2.1.1 cluster on EC2 using C3.2XLarge nodes
which overall is working very well for us.
However, after letting it run for a while I
seem to get into a situation where the amount
of disk space used far exceeds the total
amount of data on each node and I haven't been
able to get the size to go back down except by
stopping and restarting the node.
For example, in my data I have almost all of
my data in one table. On one of my nodes
right now the total space used (as reported by
nodetool cfstats) is 57.2 GB and there are no
snapshots. However, when I look at the size of
the data files (using du) the data file for
that table is 107GB. Because the C3.2XLarge
only have 160 GB of SSD you can see why this
quickly becomes a problem.
Running nodetool compact didn't reduce the
size and neither does running nodetool repair
-pr on the node. I also tried nodetool flush
and nodetool cleanup (even though I have not
added or removed any nodes recently) but it
didn't change anything either. In order to
keep my cluster up I then stopped and started
that node and the size of the data file
dropped to 54GB while the total column family
size (as reported by nodetool) stayed about
the same.
Any suggestions as to what I could be doing wrong?
Thanks,
Nate