> 2. ... So I am going to use rotational disk for the commit log and an SSD
for data. Does this make sense?

 

Yes, just keep in mind however that the primary characteristic of SSDs is
lower seek times which translates into faster random access. We have a
similar Cassandra use case (time series data and comparable volumes) and
decided the random read performance boost (unquantified in our case to be
fair) was not worth the price and we went with more, larger, cheaper 7.2k
HDDs. 

 

> 3. What's the best way to find out how big my commitlog disk and my data
disk has to be? The Cassandra hardware page says the Commitlog disk
shouldn't be big but still I need to choose a size!

 

As of Cassandra 1.0, the commit log has an explicit size bound (defaulting
to 4GB I believe). In 0.8, I dont think I have ever seen my commit log grow
beyond that point but the limit should be the ammount of data you insert
within the maximum CF timed flush period ("memtable_flush_after" parameter,
to be safe, maximumum across all CFs). Any modern drive should be
sufficient. As for the size of your data disks, that is largely application
dependent, and you should be able to judge best based on your currnet
cluster. 

 

> 4. I also noticed RAID 0 configuration is recommended for the data file
directory. Can anyone explain why?

 

In comparison to RAID1/RAID1+0? For any RF > 1, Cassadra already takes care
of redundancy by replicating the data across multiple nodes. Your
applications choice of replication factor and read/write consistencies
should be specified to tollerate a node failing (for any reason: disk
failure, network failure, a disgruntled employee taking a sledge hammer to
the box, etc). As such, what is the point of waisting your disks duplicating
data on a single machine to minimize the chances of one particular type of
failure when it should not matter anyways? 

 

Dan

 

From: Alexandru Sicoe [mailto:adsi...@gmail.com] 
Sent: October-25-11 8:23
To: user@cassandra.apache.org
Subject: Cassandra cluster HW spec (commit log directory vs data file
directory)

 

Hi everyone,

I am currently in the process of writing a hardware proposal for a Cassandra
cluster for storing a lot of monitoring time series data. My workload is
write intensive and my data set is extremely varied in types of variables
and insertion rate for these variables (I will have to handle an order of 2
million variables coming in, each at very different rates - the majority of
them will come at very low rates but there are many that will come at higher
rates constant rates and a few coming in with huge spikes in rates). These
variables correspond to all basic C++ types and arrays of these types. The
highest insertion rates are received for basic types, out of which U32
variables seem to be the most prevalent (e.g. I recorded 2 million U32 vars
were inserted in 8 mins of operation while 600.000 doubles and 170.000
strings were inserted during the same time. Note this measurement was only
for a subset of the total data currently taken in).

At the moment I am partitioning the data in Cassandra in 75 CFs (each CF
corresponds to a logical partitioning of the set of variables mentioned
before - but this partitioning is not related with the amount of data or
rates...it is somewhat random). These 75 CFs account for ~1 million of the
variables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each
node is a 4 real core with 4 GB RAM and split commit log directory and data
file directory between two RAID arrays with HDDs). I can handle the load in
this configuration but the average CPU usage of the Cassandra nodes is
slightly above 50%. As I will need to add 12 more CFs (corresponding to
another ~ 1 million variables) plus potentially other data later, it is
clear that I need better hardware (also for the retrieval part).

I am looking at Dell servers (Power Edge etc)

Questions:

1. Is anyone using Dell HW for their Cassandra clusters? How do they behave?
Anybody care to share their configurations or tips for buying, what to avoid
etc?

2. Obviously I am going to keep to the advice on the
http://wiki.apache.org/cassandra/CassandraHardware and split the commmitlog
and data on separate disks. I was going to use SSD for commitlog but then
did some more research and found out that it doesn't make sense to use SSDs
for sequential appends because it won't have a performance advantage with
respect to rotational media. So I am going to use rotational disk for the
commit log and an SSD for data. Does this make sense?

3. What's the best way to find out how big my commitlog disk and my data
disk has to be? The Cassandra hardware page says the Commitlog disk
shouldn't be big but still I need to choose a size!

4. I also noticed RAID 0 configuration is recommended for the data file
directory. Can anyone explain why?

Sorry for the huge email.....

Cheers,
Alex

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.920 / Virus Database: 271.1.1/3972 - Release Date: 10/24/11
14:35:00

Reply via email to