[Please feel free to correct me on anything or suggest other workarounds
that could be employed now to help.]
Hello,
This is purely theoretical, as I don't have a big working cluster yet
and am still in the planning stages, but from what I understand, while
Cass scales well horizontally, EACH node will not be able to handle well
a data store in the terabyte range... for reasons that are
understandable such as simple hardware and bandwidth limitations. But,
looking forward and pushing the envelope, I think there might be ways to
at least manage these issues until broadband speeds, disk and memory
technology catches up.
The biggest issues with big data clusters that I am currently aware are:
> disk I/O probs during major compaction and repairs.
> Bandwidth limitations during new node commissioning.
Here are a few ideas I've thought of:
1.) Load-balancing:
During a major compaction or repair or other similar severe performance
impacting processes, allow the node to broadcast that it is temporarily
unavailable so requests for data can be sent to other nodes in the
cluster. The node could still "wake-up" and pause or cancel it's
compaction in the case of a failed node whereby there are no other nodes
that can provide the data requested. The node could be considered as
"degraded" by other nodes, rather than down. (As a matter of fact, a
general load-balancing scheme could be devised if each node broadcasts
it's current load level and maybe even hop-count between data centers.)
2.) Data Transplants:
Since commissioning a new node that is due to receive data in the TB
range (data xfer could take days or weeks), it would be much more
efficient to just courier the data. Perhaps the SSTables (maybe from a
snapshot) could be transplanted from one production node into a new node
to help jump-start the bootstrap process. The new node could sort
things out during the bootstrapping phase so that it is balanced
correctly as if it had started out with no data as usual. If this could
cut down on half the bandwidth, that would be a great benefit. However,
this would work well mostly if the transplanted data came from a
keyspace that used a random partitioner; coming from an ordered
partioner may not be so helpful if the rows in the transplanted data
would never be used in the new node.
3.) Strategic Partitioning:
Of course, there are surely other issues to contend with, such as RAM
requirements for caching purposes. That may be managed by a partition
strategy that allows certain nodes to specialize in a certain subset of
the data, such as geographically or whatever the designer chooses.
Replication would still be done as usual but this may help the cache to
be better utilized by allowing it to focus on the subset of data that
comprises the majority of the node's data versus a random sampling of
the entire cluster. IOW, while a node may specialize in a certain
subset and also contain replicated rows from outside that subset, it
will still only (mostly) be queried for data from within it's subset and
thus the cache will contain mostly data from this special subset which
could increase the hit rate of the cache.
This may not be a huge help for TB sized data nodes since the even 32 GB
of RAM would still be relatively tiny in comparison to the data size,
but I include it just in case it spurs other ideas. Also, I do not know
how Cass decides on which node to query for data in the first place...
maybe not the best idea.
4.) Compressed Columns:
Some sort of data compression of certain columns could be very helpful
especially since text can be compressed to less than 50% if the
conditions are right. Overall native disk compression will not help the
bandwidth issue since the data would be decompressed before transit. If
the data was stored compressed, then Cass could even send the data to
the client compressed so as to offload the decompression to the client.
Likewise, during node commission, the data would never have to be
decompressed saving on CPU and BW. Alternately, a client could tell
Cass to decompress the data before transmit if needed. This, combined
with idea #1 (transplants) could help speed-up new node bootstraping,
but only when a large portion of the data consists of very large column
values and thus compression is practical and efficient. Of course, the
client could handle all the compression today without Cass even knowing
about it, so building this into Cass would be just a convenience, but
still nice to have, nonetheless.
5.) Postponed Major Compactions:
The option to postpone auto-triggered major compactions until a
pre-defined time of day or week or until staff can do it manually.
6.?) Finally, some have suggested just using more nodes with less data
storage which may solve most if not all of these problems. But, I'm
still fuzzy on that. The trade-offs would be more infrastructure and
maintenance costs, higher chance that a server will fail... maybe higher
bandwidth between nodes due to a large cluster??? I need more clarity
on this alternative. Imagine a total data size of 100 TBs and the
choice between 200 nodes or 50. What is the cost of more nodes; all
things being equal?
Please contribute additional ideas and strategies/patterns for the
benefit of all!
Thanks for listening and keep up the good work guys!