Thanks for the responses! We'll definitely go for powerful servers to reduce the total count. Beyond a dozen servers there really doesn't seem to be much point in trying to increase count anymore for replication/redundancy. I'm assuming we will use level compaction, which means that we'll most likely run out of cpu before we run out of I/O. At least that has been my experience so far. I'm glad to hear that 100+ nodes isn't that unusual anymore in the cassandra world.
On 1/21/2012 3:38 AM, Eric Czech wrote: > I'd also add that one of the biggest complications to arise from > having multiple clusters is that read biased client applications would > need to be aware of all clusters and either aggregate result sets or > involve logic to choose the right cluster based on a particular query. > > And from a more operational perspective, I think you'd have a tough > time find monitoring applications (like Opscenter) that would support > multiple clusters within the same viewport. Having used multiple > clusters in the past, I can definitely tell you that from an > administrative, operational, and development standpoint, one cluster > is almost definitely better than many. > > Oh and I'm positive that there are other cassandra deployments out > there with well beyond 100 nodes so I don't thinking you're really > treading on dangerous ground here. > > I'd definitely say that you should try to use a single cluster if > possible. > > On Fri, Jan 20, 2012 at 9:34 PM, Maxim Potekhin <potek...@bnl.gov > <mailto:potek...@bnl.gov>> wrote: > > You can also scale not "horizontally" but "diagonally", > i.e. raid SSDs and have multicore CPUs. This means that > you'll have same performance with less nodes, making > it far easier to manage. > > SSDs by themselves will give you an order of magnitude > improvement on I/O. > > > > On 1/19/2012 9:17 PM, Thorsten von Eicken wrote: > > We're embarking on a project where we estimate we will need on > the order > of 100 cassandra nodes. The data set is perfectly > partitionable, meaning > we have no queries that need to have access to all the data at > once. We > expect to run with RF=2 or =3. Is there some notion of ideal > cluster > size? Or perhaps asked differently, would it be easier to run > one large > cluster or would it be easier to run a bunch of, say, 16 node > clusters? > Everything we've done to date has fit into 4-5 node clusters. > > >