Re: Really need some advices on large data considerations

2014-05-16 Thread Yatong Zhang
Hi Michael, thanks for the reply,

I would RAID0 all those data drives, personally, and give up managing them
 separately. They are on multiple PCIe controllers, one drive per channel,
 right?


Raid 0 is a simple way to go but one disk failure can cause the whole
volume down, so I am afraid raid 0 won't be our choice.

I would highly suggest re-thinking about how you want to set up your data
 model and re-plan your cluster appropriately,


Our data is large but our model is simple and most of the operation is
reading by key, and we never update the data (only delete periodically).
Due to its 'dynamo' arch serving so much 'static' data on cassandra is not
a problem. What I am concerning is the 'dynamic' part, compactions, adding
/ removing nodes, data re-blancing or some thing like that.

One thing we most care is scalability and fail-over strategy and looks like
Cassandra is splendid for this: linear scalability, decentralized,
auto-partition, auto-recovery. So we choose it.


 but if you are using large blobs like image data, think about putting that
 blob data somewhere else


Any good ideas about this?

The doc you mentioned on the datastax site is great. we're still gathering
information and evaluating cassandra, and it'll be great if you have any
other suggestions!

Thanks

Best


Re: Really need some advices on large data considerations

2014-05-16 Thread DuyHai Doan
You can watch this: https://www.youtube.com/watch?v=uoggWahmWYI

 Aaron is discussing about support for big nodes




On Wed, May 14, 2014 at 3:13 AM, Yatong Zhang bluefl...@gmail.com wrote:

 Thank you Aaron, but we're planning about 20T per node, is that feasible?


 On Mon, May 12, 2014 at 4:33 PM, Aaron Morton aa...@thelastpickle.comwrote:

 We've learned that compaction strategy would be an important point cause
 we've ran into 'no space' trouble because of the 'sized tiered'  compaction
 strategy.

 If you want to get the most out of the raw disk space LCS is the way to
 go, remember it uses approximately twice the disk IO.

 From our experience changing any settings/schema during a large cluster
 is on line and has been running for some time is really really a pain.

 Which parts in particular ?

 Updating the schema or config ? OpsCentre has a rolling restart feature
 which can be handy when chef / puppet is deploying the config changes.
 Schema / gossip can take a little to propagate with high number of nodes.

 On a modern version you should be able to run 2 to 3 TB per node, maybe
 higher. The biggest concerns are going to be repair (the changes in 2.1
 will help) and bootstrapping. I’d recommend testing a smaller cluster, say
 12 nodes, with a high load per node 3TB.

 cheers
 Aaron

 -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 9/05/2014, at 12:09 pm, Yatong Zhang bluefl...@gmail.com wrote:

 Hi,

 We're going to deploy a large Cassandra cluster in PB level. Our scenario
 would be:

 1. Lots of writes, about 150 writes/second at average, and about 300K
 size per write.
 2. Relatively very small reads
 3. Our data will be never updated
 4. But we will delete old data periodically to free space for new data

 We've learned that compaction strategy would be an important point cause
 we've ran into 'no space' trouble because of the 'sized tiered'  compaction
 strategy.

 We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderationsand is 
 this enough or update-to-date? From our experience changing any
 settings/schema during a large cluster is on line and has been running for
 some time is really really a pain. So we're gathering more info and
 expecting some more practical suggestions before we set up  the cassandra
 cluster.

 Thanks and any help is of great appreciation






Re: Really need some advices on large data considerations

2014-05-14 Thread Yatong Zhang
Thank you Aaron, but we're planning about 20T per node, is that feasible?


On Mon, May 12, 2014 at 4:33 PM, Aaron Morton aa...@thelastpickle.comwrote:

 We've learned that compaction strategy would be an important point cause
 we've ran into 'no space' trouble because of the 'sized tiered'  compaction
 strategy.

 If you want to get the most out of the raw disk space LCS is the way to
 go, remember it uses approximately twice the disk IO.

 From our experience changing any settings/schema during a large cluster is
 on line and has been running for some time is really really a pain.

 Which parts in particular ?

 Updating the schema or config ? OpsCentre has a rolling restart feature
 which can be handy when chef / puppet is deploying the config changes.
 Schema / gossip can take a little to propagate with high number of nodes.

 On a modern version you should be able to run 2 to 3 TB per node, maybe
 higher. The biggest concerns are going to be repair (the changes in 2.1
 will help) and bootstrapping. I’d recommend testing a smaller cluster, say
 12 nodes, with a high load per node 3TB.

 cheers
 Aaron

 -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 9/05/2014, at 12:09 pm, Yatong Zhang bluefl...@gmail.com wrote:

 Hi,

 We're going to deploy a large Cassandra cluster in PB level. Our scenario
 would be:

 1. Lots of writes, about 150 writes/second at average, and about 300K size
 per write.
 2. Relatively very small reads
 3. Our data will be never updated
 4. But we will delete old data periodically to free space for new data

 We've learned that compaction strategy would be an important point cause
 we've ran into 'no space' trouble because of the 'sized tiered'  compaction
 strategy.

 We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderationsand is 
 this enough or update-to-date? From our experience changing any
 settings/schema during a large cluster is on line and has been running for
 some time is really really a pain. So we're gathering more info and
 expecting some more practical suggestions before we set up  the cassandra
 cluster.

 Thanks and any help is of great appreciation





Really need some advices on large data considerations

2014-05-13 Thread Yatong Zhang
Hi,

We're going to deploy a large Cassandra cluster in PB level. Our scenario
would be:

1. Lots of writes, about 150 writes/second at average, and about 300K size
per write.
2. Relatively very small reads
3. Our data will be never updated
4. But we will delete old data periodically to free space for new data

We've learned that compaction strategy would be an important point cause
we've ran into 'no space' trouble because of the 'sized tiered'  compaction
strategy.

We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderations and
is this enough or update-to-date? From our experience changing any
settings/schema during a large cluster is on line and has been running for
some time is really really a pain. So we're gathering more info and
expecting some more practical suggestions before we set up  the cassandra
cluster.

Thanks and any help is of great appreciation


Re: Really need some advices on large data considerations

2014-05-12 Thread Aaron Morton
 We've learned that compaction strategy would be an important point cause 
 we've ran into 'no space' trouble because of the 'sized tiered'  compaction 
 strategy.
If you want to get the most out of the raw disk space LCS is the way to go, 
remember it uses approximately twice the disk IO. 

 From our experience changing any settings/schema during a large cluster is on 
 line and has been running for some time is really really a pain.
Which parts in particular ? 

Updating the schema or config ? OpsCentre has a rolling restart feature which 
can be handy when chef / puppet is deploying the config changes. Schema / 
gossip can take a little to propagate with high number of nodes. 
 
On a modern version you should be able to run 2 to 3 TB per node, maybe higher. 
The biggest concerns are going to be repair (the changes in 2.1 will help) and 
bootstrapping. I’d recommend testing a smaller cluster, say 12 nodes, with a 
high load per node 3TB. 

cheers
Aaron
 
-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 9/05/2014, at 12:09 pm, Yatong Zhang bluefl...@gmail.com wrote:

 Hi,
 
 We're going to deploy a large Cassandra cluster in PB level. Our scenario 
 would be:
 
 1. Lots of writes, about 150 writes/second at average, and about 300K size 
 per write.
 2. Relatively very small reads
 3. Our data will be never updated
 4. But we will delete old data periodically to free space for new data
 
 We've learned that compaction strategy would be an important point cause 
 we've ran into 'no space' trouble because of the 'sized tiered'  compaction 
 strategy.
 
 We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderations and is 
 this enough or update-to-date? From our experience changing any 
 settings/schema during a large cluster is on line and has been running for 
 some time is really really a pain. So we're gathering more info and expecting 
 some more practical suggestions before we set up  the cassandra cluster. 
 
 Thanks and any help is of great appreciation