Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

aaron morton Thu, 06 Dec 2012 19:43:43 -0800

> Meaning terabyte size databases. 
> 
Lots of people have TB sized systems. Just add more nodes. 
300 to 400 Gb is just a rough guideline. The bigger picture is considering how 
routine and non routine maintenance tasks are going to be carried out.


Cheers
  
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/12/2012, at 4:38 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote:

> http://wiki.apache.org/cassandra/LargeDataSetConsiderations
> 
> 
> On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L 
> <wade.l.poziom...@intel.com> wrote:
> “Having so much data on each node is a potential bad day.”
> 
>  
> 
> Is this discussed somewhere on the Cassandra documentation (limits, practices 
> etc)?  We are also trying to load up quite a lot of data and have hit memory 
> issues (bloom filter etc.) in 1.0.10.  I would like to read up on big data 
> usage of Cassandra.  Meaning terabyte size databases. 
> 
>  
> 
> I do get your point about the amount of time required to recover downed node. 
> But this 300-400MB business is interesting to me.
> 
>  
> 
> Thanks in advance.
> 
>  
> 
> Wade
> 
>  
> 
> From: aaron morton [mailto:aa...@thelastpickle.com] 
> Sent: Wednesday, December 05, 2012 9:23 PM
> To: user@cassandra.apache.org
> Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered 
> compaction.
> 
>  
> 
> Basically we were successful on two of the nodes. They both took ~2 days and 
> 11 hours to complete and at the end we saw one very large file ~900GB and the 
> rest much smaller (the overall size decreased). This is what we expected!
> 
> I would recommend having up to 300MB to 400MB per node on a regular HDD with 
> 1GB networking. 
> 
>  
> 
> But on the 3rd node, we suspect major compaction didn't actually finish it's 
> job…
> 
> The file list looks odd. Check the time stamps, on the files. You should not 
> have files older than when compaction started. 
> 
>  
> 
> 8GB heap 
> 
> The default is 4GB max now days. 
> 
>  
> 
> 1) Do you expect problems with the 3rd node during 2 weeks more of 
> operations, in the conditions seen below? 
> 
> I cannot answer that. 
> 
>  
> 
> 2) Should we restart with leveled compaction next year? 
> 
> I would run some tests to see how it works for you workload. 
> 
>  
> 
> 4) Should we consider increasing the cluster capacity?
> 
> IMHO yes.
> 
> You may also want to do some experiments with turing compression on if it not 
> already enabled. 
> 
>  
> 
> Having so much data on each node is a potential bad day. If instead you had 
> to move or repair one of those nodes how long would it take for cassandra to 
> stream all the data over ? (Or to rsync the data over.) How long does it take 
> to run nodetool repair on the node ?
> 
>  
> 
> With RF 3, if you lose a node you have lost your redundancy. It's important 
> to have a plan about how to get it back and how long it may take.   
> 
>  
> 
> Hope that helps. 
> 
>  
> 
> -----------------
> 
> Aaron Morton
> 
> Freelance Cassandra Developer
> 
> New Zealand
> 
>  
> 
> @aaronmorton
> 
> http://www.thelastpickle.com
> 
>  
> 
> On 6/12/2012, at 3:40 AM, Alexandru Sicoe <adsi...@gmail.com> wrote:
> 
> 
> 
> 
> Hi guys,
> Sorry for the late follow-up but I waited to run major compactions on all 3 
> nodes at a time before replying with my findings.
> 
> Basically we were successful on two of the nodes. They both took ~2 days and 
> 11 hours to complete and at the end we saw one very large file ~900GB and the 
> rest much smaller (the overall size decreased). This is what we expected!
> 
> But on the 3rd node, we suspect major compaction didn't actually finish it's 
> job. First of all nodetool compact returned much earlier than the rest - 
> after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only 
> about 36GB were freed up (almost the same size as before). Saw nothing in the 
> server log (debug not enabled). Below I pasted some more details about file 
> sizes before and after compaction on this third node and disk occupancy.
> 
> The situation is maybe not so dramatic for us because in less than 2 weeks we 
> will have a down time till after the new year. During this we can completely 
> delete all the data in the cluster and start fresh with TTLs for 1 month (as 
> suggested by Aaron and 8GB heap as suggested by Alain - thanks).
> 
> Questions:
> 
> 1) Do you expect problems with the 3rd node during 2 weeks more of 
> operations, in the conditions seen below? 
> [Note: we expect the minor compactions to continue building up files but 
> never really getting to compacting the large file and thus not needing much 
> temporarily extra disk space].
> 
> 2) Should we restart with leveled compaction next year? 
> [Note: Aaron was right, we have 1 week rows which get deleted after 1 month 
> which means older rows end up in big files => to free up space with 
> SizeTiered we will have no choice but run major compactions which we don't 
> know if they will work provided that we get at ~1TB / node / 1 month. You can 
> see we are at the limit!]
> 
> 3) In case we keep SizeTiered:
> 
>     - How can we improve the performance of our major compactions? (we left 
> all config parameters as default). Would increasing compactions throughput 
> interfere with writes and reads? What about multi-threaded compactions?
> 
>     - Do we still need to run regular repair operations as well? Do these 
> also do a major compaction or are they completely separate operations? 
> 
> [Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and 
> reading at consistency level ALL. We read primarily for exporting reasons - 
> we export 1 week worth of data at a time].
> 
> 4) Should we consider increasing the cluster capacity?
> [We generate ~5million new rows every week which shouldn't come close to the 
> hundreds of millions of rows on a node mentioned by Aaron which are the 
> volumes that would create problems with bloom filters and indexes].
> 
> Cheers,
> Alex
> ------------------
> 
> The situation in the data folder 
> 
>     before calling nodetool comapact:
> 
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
> 305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
> 39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
> 78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
> 81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
> 205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
> 333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
> 99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
> 2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
> 1.4T    total
> 
>     after nodetool comapact returned:
> 
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
> 5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
> 4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
> 338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
> 98M  
> 
> 
> Looking at the disk occupancy for the logical partition where the data folder 
> is in:
> 
> df /data_bst
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst
> 
> 
> and the situation in the cluster
> 
> nodetool -h $HOSTNAME ring (before major compaction)
> Address         DC          Rack        Status State   Load            
> Effective-Ownership Token                                       
>                                                                               
>              113427455640312821154458202477256070484     
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         66.67% 
>              0                                           
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         66.67% 
>              56713727820156410577229101238628035242      
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         66.67% 
>              113427455640312821154458202477256070484
> 
> nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting 
> data in the meantime)
> Address         DC          Rack        Status State   Load            
> Effective-Ownership Token                                       
>                                                                               
>              113427455640312821154458202477256070484     
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         66.67% 
>              0                                           
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         66.67% 
>              56713727820156410577229101238628035242      
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         66.67% 
>              113427455640312821154458202477256070484
> 
> 
>  
> 
> On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com> wrote:
> 
> >  From what I know having too much data on one node is bad, not really sure 
> > why, but  I think that performance will go down due to the size of indexes 
> > and bloom filters (I may be wrong on the reasons but I'm quite sure you 
> > can't store too much data per node).
> 
> If you have many hundreds of millions of rows on a node the memory needed for 
> bloom filters and index sampling can be significant. These can both be tuned.
> 
> If you have 1.1T per node the time to do a compaction, repair or upgrade may 
> be very significant. Also the time taken to copy this data should you need to 
> remove or replace a node may be prohibitive.
> 
> 
> > 2. Switch to Leveled compaction strategy.
> 
> I would avoid making a change like that on an unstable / at risk system.
> 
> > - Our usage pattern is write once, read once (export) and delete once!
> 
>  The column TTL may be of use to you, it removes the need to do a delete.
> 
> > - We were thinking of relying on the automatic minor compactions to free up 
> > space for us but as..
> There are some usage patterns which make life harder for STS. For example if 
> you have very long lived rows that are written to and deleted a lot. Row 
> fragments that have been around for a while will end up in bigger files, and 
> these files get compacted less often.
> 
> In this situation, if you are running low on disk space and you think there 
> is a lot of deleted data in there, I would run a major compaction. A word or 
> warning though, if do this you will need to continue to do it regularly. 
> Major compaction creates a single big file, that will not get compaction 
> often. There are ways to resolve this, and moving to LDB may help in the 
> future.
> 
> If you are stuck and worried about disk space it's what I would do. Once you 
> are stable again then look at LDB 
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> 
> On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:
> 
> > Hi Alexandru,
> >
> > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per 
> > node for the data dir and separate disk for the commitlog, 12 cores, 24 GB 
> > RAM"
> >
> > I think you should tune your architecture in a very different way. From 
> > what I know having too much data on one node is bad, not really sure why, 
> > but  I think that performance will go down due to the size of indexes and 
> > bloom filters (I may be wrong on the reasons but I'm quite sure you can't 
> > store too much data per node).
> >
> > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be 
> > better if you have the choice.
> >
> > "(12GB to Cassandra heap)."
> >
> > The max heap recommanded is 8GB because if you use more than these 8GB the 
> > Gc jobs will start decreasing your performance.
> >
> > "We now have 1.1 TB worth of data per node (RF = 2)."
> >
> > You should use RF=3 unless one out of consistency or SPOF  doesn't matter 
> > to you.
> >
> > With RF=2 you are obliged to write at CL.one to remove the single point of 
> > failure.
> >
> > "1. Start issuing regular major compactions (nodetool compact).
> >      - This is not recommended:
> >             - Stops minor compactions.
> >             - Major performance hit on node (very bad for us because need 
> > to be taking data all the time)."
> >
> > Actually, major compaction *does not* stop minor compactions. What happens 
> > is that due to the size of the size of the sstable that remains after your 
> > major compaction, it will never be compacted with the upcoming new 
> > sstables, and because of that, your read performance will go down until you 
> > run an other major compaction.
> >
> > "2. Switch to Leveled compaction strategy.
> >       - It is mentioned to help with deletes and disk space usage. Can 
> > someone confirm?"
> >
> > From what I know, Leveled compaction will not free disk space. It will 
> > allow you to use a greater percentage of your total disk space (50% max for 
> > sized tier compaction vs about 80% for leveled compaction)
> >
> > "Our usage pattern is write once, read once (export) and delete once! "
> >
> > In this case, I think that leveled compaction fits your needs.
> >
> > "Can anyone suggest which (if any) is better? Are there better solutions?"
> >
> > Are your sstable compressed ? You have 2 types of built-in compression and 
> > you may use them depending on the model of each of your CF.
> >
> > see: 
> > http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
> >
> > Alain
> >
> > 2012/11/22 Alexandru Sicoe <adsi...@gmail.com>
> > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per 
> > node for the data dir and separate disk for the commitlog, 12 cores, 24 GB 
> > RAM (12GB to Cassandra heap).
> >
> 
>  
> 
>  
> 
>

Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Reply via email to