> Meaning terabyte size databases. > Lots of people have TB sized systems. Just add more nodes. 300 to 400 Gb is just a rough guideline. The bigger picture is considering how routine and non routine maintenance tasks are going to be carried out.
Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 7/12/2012, at 4:38 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > http://wiki.apache.org/cassandra/LargeDataSetConsiderations > > > On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L > <wade.l.poziom...@intel.com> wrote: > “Having so much data on each node is a potential bad day.” > > > > Is this discussed somewhere on the Cassandra documentation (limits, practices > etc)? We are also trying to load up quite a lot of data and have hit memory > issues (bloom filter etc.) in 1.0.10. I would like to read up on big data > usage of Cassandra. Meaning terabyte size databases. > > > > I do get your point about the amount of time required to recover downed node. > But this 300-400MB business is interesting to me. > > > > Thanks in advance. > > > > Wade > > > > From: aaron morton [mailto:aa...@thelastpickle.com] > Sent: Wednesday, December 05, 2012 9:23 PM > To: user@cassandra.apache.org > Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered > compaction. > > > > Basically we were successful on two of the nodes. They both took ~2 days and > 11 hours to complete and at the end we saw one very large file ~900GB and the > rest much smaller (the overall size decreased). This is what we expected! > > I would recommend having up to 300MB to 400MB per node on a regular HDD with > 1GB networking. > > > > But on the 3rd node, we suspect major compaction didn't actually finish it's > job… > > The file list looks odd. Check the time stamps, on the files. You should not > have files older than when compaction started. > > > > 8GB heap > > The default is 4GB max now days. > > > > 1) Do you expect problems with the 3rd node during 2 weeks more of > operations, in the conditions seen below? > > I cannot answer that. > > > > 2) Should we restart with leveled compaction next year? > > I would run some tests to see how it works for you workload. > > > > 4) Should we consider increasing the cluster capacity? > > IMHO yes. > > You may also want to do some experiments with turing compression on if it not > already enabled. > > > > Having so much data on each node is a potential bad day. If instead you had > to move or repair one of those nodes how long would it take for cassandra to > stream all the data over ? (Or to rsync the data over.) How long does it take > to run nodetool repair on the node ? > > > > With RF 3, if you lose a node you have lost your redundancy. It's important > to have a plan about how to get it back and how long it may take. > > > > Hope that helps. > > > > ----------------- > > Aaron Morton > > Freelance Cassandra Developer > > New Zealand > > > > @aaronmorton > > http://www.thelastpickle.com > > > > On 6/12/2012, at 3:40 AM, Alexandru Sicoe <adsi...@gmail.com> wrote: > > > > > Hi guys, > Sorry for the late follow-up but I waited to run major compactions on all 3 > nodes at a time before replying with my findings. > > Basically we were successful on two of the nodes. They both took ~2 days and > 11 hours to complete and at the end we saw one very large file ~900GB and the > rest much smaller (the overall size decreased). This is what we expected! > > But on the 3rd node, we suspect major compaction didn't actually finish it's > job. First of all nodetool compact returned much earlier than the rest - > after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only > about 36GB were freed up (almost the same size as before). Saw nothing in the > server log (debug not enabled). Below I pasted some more details about file > sizes before and after compaction on this third node and disk occupancy. > > The situation is maybe not so dramatic for us because in less than 2 weeks we > will have a down time till after the new year. During this we can completely > delete all the data in the cluster and start fresh with TTLs for 1 month (as > suggested by Aaron and 8GB heap as suggested by Alain - thanks). > > Questions: > > 1) Do you expect problems with the 3rd node during 2 weeks more of > operations, in the conditions seen below? > [Note: we expect the minor compactions to continue building up files but > never really getting to compacting the large file and thus not needing much > temporarily extra disk space]. > > 2) Should we restart with leveled compaction next year? > [Note: Aaron was right, we have 1 week rows which get deleted after 1 month > which means older rows end up in big files => to free up space with > SizeTiered we will have no choice but run major compactions which we don't > know if they will work provided that we get at ~1TB / node / 1 month. You can > see we are at the limit!] > > 3) In case we keep SizeTiered: > > - How can we improve the performance of our major compactions? (we left > all config parameters as default). Would increasing compactions throughput > interfere with writes and reads? What about multi-threaded compactions? > > - Do we still need to run regular repair operations as well? Do these > also do a major compaction or are they completely separate operations? > > [Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and > reading at consistency level ALL. We read primarily for exporting reasons - > we export 1 week worth of data at a time]. > > 4) Should we consider increasing the cluster capacity? > [We generate ~5million new rows every week which shouldn't come close to the > hundreds of millions of rows on a node mentioned by Aaron which are the > volumes that would create problems with bloom filters and indexes]. > > Cheers, > Alex > ------------------ > > The situation in the data folder > > before calling nodetool comapact: > > du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db > 444G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db > 376G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db > 305G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db > 39G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db > 78G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db > 81G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db > 205M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db > 20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db > 20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db > 20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db > 4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db > 4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db > 4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db > 333M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db > 92M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db > 92M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db > 99M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db > 2.5G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db > 1.4T total > > after nodetool comapact returned: > > du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db > 444G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db > 910G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db > 19G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db > 19G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db > 5.0G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db > 4.8G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db > 338M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db > 339M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db > 339M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db > 98M > > > Looking at the disk occupancy for the logical partition where the data folder > is in: > > df /data_bst > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sdb1 2927242720 1482502260 1444740460 51% /data_bst > > > and the situation in the cluster > > nodetool -h $HOSTNAME ring (before major compaction) > Address DC Rack Status State Load > Effective-Ownership Token > > 113427455640312821154458202477256070484 > 10.146.44.17 datacenter1 rack1 Up Normal 1.37 TB 66.67% > 0 > 10.146.44.18 datacenter1 rack1 Up Normal 1.04 TB 66.67% > 56713727820156410577229101238628035242 > 10.146.44.32 datacenter1 rack1 Up Normal 1.14 TB 66.67% > 113427455640312821154458202477256070484 > > nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting > data in the meantime) > Address DC Rack Status State Load > Effective-Ownership Token > > 113427455640312821154458202477256070484 > 10.146.44.17 datacenter1 rack1 Up Normal 1.38 TB 66.67% > 0 > 10.146.44.18 datacenter1 rack1 Up Normal 1.08 TB 66.67% > 56713727820156410577229101238628035242 > 10.146.44.32 datacenter1 rack1 Up Normal 1.19 TB 66.67% > 113427455640312821154458202477256070484 > > > > > On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com> wrote: > > > From what I know having too much data on one node is bad, not really sure > > why, but I think that performance will go down due to the size of indexes > > and bloom filters (I may be wrong on the reasons but I'm quite sure you > > can't store too much data per node). > > If you have many hundreds of millions of rows on a node the memory needed for > bloom filters and index sampling can be significant. These can both be tuned. > > If you have 1.1T per node the time to do a compaction, repair or upgrade may > be very significant. Also the time taken to copy this data should you need to > remove or replace a node may be prohibitive. > > > > 2. Switch to Leveled compaction strategy. > > I would avoid making a change like that on an unstable / at risk system. > > > - Our usage pattern is write once, read once (export) and delete once! > > The column TTL may be of use to you, it removes the need to do a delete. > > > - We were thinking of relying on the automatic minor compactions to free up > > space for us but as.. > There are some usage patterns which make life harder for STS. For example if > you have very long lived rows that are written to and deleted a lot. Row > fragments that have been around for a while will end up in bigger files, and > these files get compacted less often. > > In this situation, if you are running low on disk space and you think there > is a lot of deleted data in there, I would run a major compaction. A word or > warning though, if do this you will need to continue to do it regularly. > Major compaction creates a single big file, that will not get compaction > often. There are ways to resolve this, and moving to LDB may help in the > future. > > If you are stuck and worried about disk space it's what I would do. Once you > are stable again then look at LDB > http://www.datastax.com/dev/blog/when-to-use-leveled-compaction > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > > On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote: > > > Hi Alexandru, > > > > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per > > node for the data dir and separate disk for the commitlog, 12 cores, 24 GB > > RAM" > > > > I think you should tune your architecture in a very different way. From > > what I know having too much data on one node is bad, not really sure why, > > but I think that performance will go down due to the size of indexes and > > bloom filters (I may be wrong on the reasons but I'm quite sure you can't > > store too much data per node). > > > > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be > > better if you have the choice. > > > > "(12GB to Cassandra heap)." > > > > The max heap recommanded is 8GB because if you use more than these 8GB the > > Gc jobs will start decreasing your performance. > > > > "We now have 1.1 TB worth of data per node (RF = 2)." > > > > You should use RF=3 unless one out of consistency or SPOF doesn't matter > > to you. > > > > With RF=2 you are obliged to write at CL.one to remove the single point of > > failure. > > > > "1. Start issuing regular major compactions (nodetool compact). > > - This is not recommended: > > - Stops minor compactions. > > - Major performance hit on node (very bad for us because need > > to be taking data all the time)." > > > > Actually, major compaction *does not* stop minor compactions. What happens > > is that due to the size of the size of the sstable that remains after your > > major compaction, it will never be compacted with the upcoming new > > sstables, and because of that, your read performance will go down until you > > run an other major compaction. > > > > "2. Switch to Leveled compaction strategy. > > - It is mentioned to help with deletes and disk space usage. Can > > someone confirm?" > > > > From what I know, Leveled compaction will not free disk space. It will > > allow you to use a greater percentage of your total disk space (50% max for > > sized tier compaction vs about 80% for leveled compaction) > > > > "Our usage pattern is write once, read once (export) and delete once! " > > > > In this case, I think that leveled compaction fits your needs. > > > > "Can anyone suggest which (if any) is better? Are there better solutions?" > > > > Are your sstable compressed ? You have 2 types of built-in compression and > > you may use them depending on the model of each of your CF. > > > > see: > > http://www.datastax.com/docs/1.1/operations/tuning#configure-compression > > > > Alain > > > > 2012/11/22 Alexandru Sicoe <adsi...@gmail.com> > > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per > > node for the data dir and separate disk for the commitlog, 12 cores, 24 GB > > RAM (12GB to Cassandra heap). > > > > > > > >