Re: Really need some advices on large data considerations
Hi Michael, thanks for the reply, I would RAID0 all those data drives, personally, and give up managing them separately. They are on multiple PCIe controllers, one drive per channel, right? Raid 0 is a simple way to go but one disk failure can cause the whole volume down, so I am afraid raid 0 won't be our choice. I would highly suggest re-thinking about how you want to set up your data model and re-plan your cluster appropriately, Our data is large but our model is simple and most of the operation is reading by key, and we never update the data (only delete periodically). Due to its 'dynamo' arch serving so much 'static' data on cassandra is not a problem. What I am concerning is the 'dynamic' part, compactions, adding / removing nodes, data re-blancing or some thing like that. One thing we most care is scalability and fail-over strategy and looks like Cassandra is splendid for this: linear scalability, decentralized, auto-partition, auto-recovery. So we choose it. but if you are using large blobs like image data, think about putting that blob data somewhere else Any good ideas about this? The doc you mentioned on the datastax site is great. we're still gathering information and evaluating cassandra, and it'll be great if you have any other suggestions! Thanks Best
Re: Really need some advices on large data considerations
You can watch this: https://www.youtube.com/watch?v=uoggWahmWYI Aaron is discussing about support for big nodes On Wed, May 14, 2014 at 3:13 AM, Yatong Zhang bluefl...@gmail.com wrote: Thank you Aaron, but we're planning about 20T per node, is that feasible? On Mon, May 12, 2014 at 4:33 PM, Aaron Morton aa...@thelastpickle.comwrote: We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. If you want to get the most out of the raw disk space LCS is the way to go, remember it uses approximately twice the disk IO. From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. Which parts in particular ? Updating the schema or config ? OpsCentre has a rolling restart feature which can be handy when chef / puppet is deploying the config changes. Schema / gossip can take a little to propagate with high number of nodes. On a modern version you should be able to run 2 to 3 TB per node, maybe higher. The biggest concerns are going to be repair (the changes in 2.1 will help) and bootstrapping. I’d recommend testing a smaller cluster, say 12 nodes, with a high load per node 3TB. cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/05/2014, at 12:09 pm, Yatong Zhang bluefl...@gmail.com wrote: Hi, We're going to deploy a large Cassandra cluster in PB level. Our scenario would be: 1. Lots of writes, about 150 writes/second at average, and about 300K size per write. 2. Relatively very small reads 3. Our data will be never updated 4. But we will delete old data periodically to free space for new data We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderationsand is this enough or update-to-date? From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. So we're gathering more info and expecting some more practical suggestions before we set up the cassandra cluster. Thanks and any help is of great appreciation
Re: Really need some advices on large data considerations
Thank you Aaron, but we're planning about 20T per node, is that feasible? On Mon, May 12, 2014 at 4:33 PM, Aaron Morton aa...@thelastpickle.comwrote: We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. If you want to get the most out of the raw disk space LCS is the way to go, remember it uses approximately twice the disk IO. From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. Which parts in particular ? Updating the schema or config ? OpsCentre has a rolling restart feature which can be handy when chef / puppet is deploying the config changes. Schema / gossip can take a little to propagate with high number of nodes. On a modern version you should be able to run 2 to 3 TB per node, maybe higher. The biggest concerns are going to be repair (the changes in 2.1 will help) and bootstrapping. I’d recommend testing a smaller cluster, say 12 nodes, with a high load per node 3TB. cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/05/2014, at 12:09 pm, Yatong Zhang bluefl...@gmail.com wrote: Hi, We're going to deploy a large Cassandra cluster in PB level. Our scenario would be: 1. Lots of writes, about 150 writes/second at average, and about 300K size per write. 2. Relatively very small reads 3. Our data will be never updated 4. But we will delete old data periodically to free space for new data We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderationsand is this enough or update-to-date? From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. So we're gathering more info and expecting some more practical suggestions before we set up the cassandra cluster. Thanks and any help is of great appreciation
Really need some advices on large data considerations
Hi, We're going to deploy a large Cassandra cluster in PB level. Our scenario would be: 1. Lots of writes, about 150 writes/second at average, and about 300K size per write. 2. Relatively very small reads 3. Our data will be never updated 4. But we will delete old data periodically to free space for new data We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderations and is this enough or update-to-date? From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. So we're gathering more info and expecting some more practical suggestions before we set up the cassandra cluster. Thanks and any help is of great appreciation
Re: Really need some advices on large data considerations
We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. If you want to get the most out of the raw disk space LCS is the way to go, remember it uses approximately twice the disk IO. From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. Which parts in particular ? Updating the schema or config ? OpsCentre has a rolling restart feature which can be handy when chef / puppet is deploying the config changes. Schema / gossip can take a little to propagate with high number of nodes. On a modern version you should be able to run 2 to 3 TB per node, maybe higher. The biggest concerns are going to be repair (the changes in 2.1 will help) and bootstrapping. I’d recommend testing a smaller cluster, say 12 nodes, with a high load per node 3TB. cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/05/2014, at 12:09 pm, Yatong Zhang bluefl...@gmail.com wrote: Hi, We're going to deploy a large Cassandra cluster in PB level. Our scenario would be: 1. Lots of writes, about 150 writes/second at average, and about 300K size per write. 2. Relatively very small reads 3. Our data will be never updated 4. But we will delete old data periodically to free space for new data We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderations and is this enough or update-to-date? From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. So we're gathering more info and expecting some more practical suggestions before we set up the cassandra cluster. Thanks and any help is of great appreciation