Re: Cassandra Files Taking up Much More Space than CF

Reynald Bourtembourg Tue, 09 Dec 2014 09:40:15 -0800

Hi Nate,

Are you using incremental backups?

Extract from the documentation (http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_backup_incremental_t.html):

/When incremental backups are enabled (disabled by default), Cassandrahard-links each flushed SSTable to a backups directory under thekeyspace data directory. This allows storing backups offsite withouttransferring entire snapshots. Also, incremental backups combine withsnapshots to provide a dependable, up-to-date backup mechanism./

//

/As with snapshots, Cassandra does not automatically clear incrementalbackup files. *DataStax recommends setting up a process to clearincremental backup hard-links each time a new snapshot is created.*/

These backups are stored in directories named "backups" at the samelevel as the "snapshots' directories.


Reynald

On 09/12/2014 18:13, Nate Yoder wrote:

Thanks for the advice. Totally makes sense. Once I figure out how tomake my data stop taking up more than 2x more space without beinguseful I'll definitely make the change :)


Nate



--
*Nathanael Yoder*
Principal Engineer & Data Scientist, Whistle
415-944-7344 // n...@whistle.com <mailto:n...@whistle.com>

On Tue, Dec 9, 2014 at 9:02 AM, Jonathan Haddad <j...@jonhaddad.com<mailto:j...@jonhaddad.com>> wrote:


    Well, I personally don't like RF=2.  It means if you're using
    CL=QUORUM and a node goes down, you're going to have a bad time.
    (downtime) If you're using CL=ONE then you'd be ok. However, I am
    not wild about losing a node and having only 1 copy of my data
    available in prod.


    On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder <n...@whistle.com
    <mailto:n...@whistle.com>> wrote:

        Thanks Jonathan.  So there is nothing too idiotic about my
        current set-up with 6 boxes each with 256 vnodes each and a RF
        of 2?

        I appreciate the help,
        Nate



        --
        *Nathanael Yoder*
        Principal Engineer & Data Scientist, Whistle
        415-944-7344 // n...@whistle.com <mailto:n...@whistle.com>

        On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad
        <j...@jonhaddad.com <mailto:j...@jonhaddad.com>> wrote:

            You don't need a prime number of nodes in your ring, but
            it's not a bad idea to it be a multiple of your RF when
            your cluster is small.


            On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder
            <n...@whistle.com <mailto:n...@whistle.com>> wrote:

                Hi Ian,

                Thanks for the suggestion but I had actually already
                done that prior to the scenario I described (to get
                myself some free space) and when I ran nodetool
                cfstats it listed 0 snapshots as expected, so
                unfortunately I don't think that is where my space went.

                One additional piece of information I forgot to point
                out is that when I ran nodetool status on the node it
                included all 6 nodes.

                I have also heard it mentioned that I may want to have
                a prime number of nodes which may help protect against
                split-brain.  Is this true?  If so does it still apply
                when I am using vnodes?

                Thanks again,
                Nate

                --
                *Nathanael Yoder*
                Principal Engineer & Data Scientist, Whistle
                415-944-7344 // n...@whistle.com <mailto:n...@whistle.com>

                On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose
                <ianr...@fullstory.com <mailto:ianr...@fullstory.com>>
                wrote:

                    Try `nodetool clearsnapshot` which will delete any
                    snapshots you have.  I have never taken a snapshot
                    with nodetool yet I found several snapshots on my
                    disk recently (which can take a lot of space).  So
                    perhaps they are automatically generated by some
                    operation? No idea.  Regardless, nuking those
                    freed up a ton of space for me.

                    - Ian


                    On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder
                    <n...@whistle.com <mailto:n...@whistle.com>> wrote:

                        Hi All,

                        I am new to Cassandra so I apologise in
                        advance if I have missed anything obvious but
                        this one currently has me stumped.

                        I am currently running a 6 node Cassandra
                        2.1.1 cluster on EC2 using C3.2XLarge nodes
                        which overall is working very well for us.
                        However, after letting it run for a while I
                        seem to get into a situation where the amount
                        of disk space used far exceeds the total
                        amount of data on each node and I haven't been
                        able to get the size to go back down except by
                        stopping and restarting the node.

                        For example, in my data I have almost all of
                        my data in one table.  On one of my nodes
                        right now the total space used (as reported by
                        nodetool cfstats) is 57.2 GB and there are no
                        snapshots. However, when I look at the size of
                        the data files (using du) the data file for
                        that table is 107GB. Because the C3.2XLarge
                        only have 160 GB of SSD you can see why this
                        quickly becomes a problem.

                        Running nodetool compact didn't reduce the
                        size and neither does running nodetool repair
                        -pr on the node. I also tried nodetool flush
                        and nodetool cleanup (even though I have not
                        added or removed any nodes recently) but it
                        didn't change anything either. In order to
                        keep my cluster up I then stopped and started
                        that node and the size of the data file
                        dropped to 54GB while the total column family
                        size (as reported by nodetool) stayed about
                        the same.

                        Any suggestions as to what I could be doing wrong?

                        Thanks,
                        Nate

Re: Cassandra Files Taking up Much More Space than CF

Reply via email to