Re: Data model for streaming a large table in real time.
You do not Need RAID0 for data. Let C* do striping over data disks. And maybe CL ANY/ONE might be sufficient for your writes. Am 08.06.2014 um 06:15 schrieb Kevin Burton bur...@spinn3r.com: we're using containers for other reasons, not just cassandra. Tightly constraining resources means we don't have to worry about cassandra , the JVM , or Linux doing something silly and using too many resources and taking down the whole box. On Sat, Jun 7, 2014 at 8:25 PM, Colin Clark co...@clark.ws wrote: You won't need containers - running one instance of Cassandra in that configuration will hum along quite nicely and will make use of the cores and memory. I'd forget the raid anyway and just mount the disks separately (jbod) -- Colin 320-221-9531 On Jun 7, 2014, at 10:02 PM, Kevin Burton bur...@spinn3r.com wrote: Right now I'm just putting everything together as a proof of concept… so just two cheap replicas for now. And it's at 1/1th of the load. If we lose data it's ok :) I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores, probably 48-64GB of RAM each box. Just one datacenter for now… We're probably going to be migrating to using linux containers at some point. This way we can have like 16GB , one 400GB SSD, 4 cores for each image. And we can ditch the RAID which is nice. :) On Sat, Jun 7, 2014 at 7:51 PM, Colin colpcl...@gmail.com wrote: To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3. Try to have at least 8 cores, 32 gig ram, and separate disks for log and data. Will you be replicating data across data centers? -- Colin 320-221-9531 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote: Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote: With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients.
Re: Data model for streaming a large table in real time.
Here’s the Jira for the proposal to remove BOP (and OPP), but you can see that there is no clear consensus and that the issue is still open: CASSANDRA-6922 - Investigate if we can drop ByteOrderedPartitioner and OrderPreservingPartitioner in 3.0 https://issues.apache.org/jira/browse/CASSANDRA-6922 You can read the DataStax Cassandra doc for why “Using an ordered partitioner is not recommended”: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePartitionerBOP_c.html “Difficult load balancing... Sequential writes can cause hot spots... Uneven load balancing for multiple tables” -- Jack Krupansky From: Kevin Burton Sent: Saturday, June 7, 2014 1:27 PM To: user@cassandra.apache.org Subject: Re: Data model for streaming a large table in real time. I just checked the source and in 2.1.0 it's not deprecated. So it *might* be *being* deprecated but I haven't seen anything stating that. On Sat, Jun 7, 2014 at 8:03 AM, Colin colpcl...@gmail.com wrote: I believe Byteorderedpartitioner is being deprecated and for good reason. I would look at what you could achieve by using wide rows and murmur3partitioner. -- Colin 320-221-9531 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote: We have the requirement to have clients read from our tables while they're being written. Basically, any write that we make to cassandra needs to be sent out over the Internet to our customers. We also need them to resume so if they go offline, they can just pick up where they left off. They need to do this in parallel, so if we have 20 cassandra nodes, they can have 20 readers each efficiently (and without coordination) reading from our tables. Here's how we're planning on doing it. We're going to use the ByteOrderedPartitioner . I'm writing with a primary key of the timestamp, however, in practice, this would yield hotspots. (I'm also aware that time isn't a very good pk in a distribute system as I can easily have a collision so we're going to use a scheme similar to a uuid to make it unique per writer). One node would take all the load, followed by the next node, etc. So my plan to stop this is to prefix a slice ID to the timestamp. This way each piece of content has a unique ID, but the prefix will place it on a node. The slide ID is just a byte… so this means there are 255 buckets in which I can place data. This means I can have clients each start with a slice, and a timestamp, and page through the data with tokens. This way I can have a client reading with 255 threads from 255 regions in the cluster, in parallel, without any hot spots. Thoughts on this strategy? -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
Hey Jack. Thanks for posting this… very helpful. So I guess the status is that it was proposed for deprecation but that proposal didn't reach consensus. Also, this gave me an idea to look at the JIRA to see what's being proposed for 3.0 :) Kevin On Sun, Jun 8, 2014 at 1:26 PM, Jack Krupansky j...@basetechnology.com wrote: Here’s the Jira for the proposal to remove BOP (and OPP), but you can see that there is no clear consensus and that the issue is still open: CASSANDRA-6922 - Investigate if we can drop ByteOrderedPartitioner and OrderPreservingPartitioner in 3.0 https://issues.apache.org/jira/browse/CASSANDRA-6922 You can read the DataStax Cassandra doc for why “Using an ordered partitioner is not recommended”: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePartitionerBOP_c.html “Difficult load balancing... Sequential writes can cause hot spots... Uneven load balancing for multiple tables” -- Jack Krupansky *From:* Kevin Burton bur...@spinn3r.com *Sent:* Saturday, June 7, 2014 1:27 PM *To:* user@cassandra.apache.org *Subject:* Re: Data model for streaming a large table in real time. I just checked the source and in 2.1.0 it's not deprecated. So it *might* be *being* deprecated but I haven't seen anything stating that. On Sat, Jun 7, 2014 at 8:03 AM, Colin colpcl...@gmail.com wrote: I believe Byteorderedpartitioner is being deprecated and for good reason. I would look at what you could achieve by using wide rows and murmur3partitioner. -- Colin 320-221-9531 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote: We have the requirement to have clients read from our tables while they're being written. Basically, any write that we make to cassandra needs to be sent out over the Internet to our customers. We also need them to resume so if they go offline, they can just pick up where they left off. They need to do this in parallel, so if we have 20 cassandra nodes, they can have 20 readers each efficiently (and without coordination) reading from our tables. Here's how we're planning on doing it. We're going to use the ByteOrderedPartitioner . I'm writing with a primary key of the timestamp, however, in practice, this would yield hotspots. (I'm also aware that time isn't a very good pk in a distribute system as I can easily have a collision so we're going to use a scheme similar to a uuid to make it unique per writer). One node would take all the load, followed by the next node, etc. So my plan to stop this is to prefix a slice ID to the timestamp. This way each piece of content has a unique ID, but the prefix will place it on a node. The slide ID is just a byte… so this means there are 255 buckets in which I can place data. This means I can have clients each start with a slice, and a timestamp, and page through the data with tokens. This way I can have a client reading with 255 threads from 255 regions in the cluster, in parallel, without any hot spots. Thoughts on this strategy? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
I believe Byteorderedpartitioner is being deprecated and for good reason. I would look at what you could achieve by using wide rows and murmur3partitioner. -- Colin 320-221-9531 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote: We have the requirement to have clients read from our tables while they're being written. Basically, any write that we make to cassandra needs to be sent out over the Internet to our customers. We also need them to resume so if they go offline, they can just pick up where they left off. They need to do this in parallel, so if we have 20 cassandra nodes, they can have 20 readers each efficiently (and without coordination) reading from our tables. Here's how we're planning on doing it. We're going to use the ByteOrderedPartitioner . I'm writing with a primary key of the timestamp, however, in practice, this would yield hotspots. (I'm also aware that time isn't a very good pk in a distribute system as I can easily have a collision so we're going to use a scheme similar to a uuid to make it unique per writer). One node would take all the load, followed by the next node, etc. So my plan to stop this is to prefix a slice ID to the timestamp. This way each piece of content has a unique ID, but the prefix will place it on a node. The slide ID is just a byte… so this means there are 255 buckets in which I can place data. This means I can have clients each start with a slice, and a timestamp, and page through the data with tokens. This way I can have a client reading with 255 threads from 255 regions in the cluster, in parallel, without any hot spots. Thoughts on this strategy? -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
One node would take all the load, followed by the next node -- with this design, you are not exploiting all the power of the cluster. If only one node takes all the load at a time, what is the point having 20 or 10 nodes ? You'd better off using limited wide row with bucketing to achieve this. You can have a look at this past thread, it may give you some ideas: https://www.mail-archive.com/user@cassandra.apache.org/msg35666.html On Sat, Jun 7, 2014 at 12:27 AM, Kevin Burton bur...@spinn3r.com wrote: We have the requirement to have clients read from our tables while they're being written. Basically, any write that we make to cassandra needs to be sent out over the Internet to our customers. We also need them to resume so if they go offline, they can just pick up where they left off. They need to do this in parallel, so if we have 20 cassandra nodes, they can have 20 readers each efficiently (and without coordination) reading from our tables. Here's how we're planning on doing it. We're going to use the ByteOrderedPartitioner . I'm writing with a primary key of the timestamp, however, in practice, this would yield hotspots. (I'm also aware that time isn't a very good pk in a distribute system as I can easily have a collision so we're going to use a scheme similar to a uuid to make it unique per writer). One node would take all the load, followed by the next node, etc. So my plan to stop this is to prefix a slice ID to the timestamp. This way each piece of content has a unique ID, but the prefix will place it on a node. The slide ID is just a byte… so this means there are 255 buckets in which I can place data. This means I can have clients each start with a slice, and a timestamp, and page through the data with tokens. This way I can have a client reading with 255 threads from 255 regions in the cluster, in parallel, without any hot spots. Thoughts on this strategy? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
I just checked the source and in 2.1.0 it's not deprecated. So it *might* be *being* deprecated but I haven't seen anything stating that. On Sat, Jun 7, 2014 at 8:03 AM, Colin colpcl...@gmail.com wrote: I believe Byteorderedpartitioner is being deprecated and for good reason. I would look at what you could achieve by using wide rows and murmur3partitioner. -- Colin 320-221-9531 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote: We have the requirement to have clients read from our tables while they're being written. Basically, any write that we make to cassandra needs to be sent out over the Internet to our customers. We also need them to resume so if they go offline, they can just pick up where they left off. They need to do this in parallel, so if we have 20 cassandra nodes, they can have 20 readers each efficiently (and without coordination) reading from our tables. Here's how we're planning on doing it. We're going to use the ByteOrderedPartitioner . I'm writing with a primary key of the timestamp, however, in practice, this would yield hotspots. (I'm also aware that time isn't a very good pk in a distribute system as I can easily have a collision so we're going to use a scheme similar to a uuid to make it unique per writer). One node would take all the load, followed by the next node, etc. So my plan to stop this is to prefix a slice ID to the timestamp. This way each piece of content has a unique ID, but the prefix will place it on a node. The slide ID is just a byte… so this means there are 255 buckets in which I can place data. This means I can have clients each start with a slice, and a timestamp, and page through the data with tokens. This way I can have a client reading with 255 threads from 255 regions in the cluster, in parallel, without any hot spots. Thoughts on this strategy? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
It's an anti-pattern and there are better ways to do this. I have implemented the paging algorithm you've described using wide rows and bucketing. This approach is a more efficient utilization of Cassandra's built in wholesome goodness. Also, I wouldn't let any number of clients (huge) connect directly the cluster to do this-put some type of app server in between to handle the comm's and fan out. You'll get better utilization of resources and less overhead in addition to flexibility of which data center you're utilizing to serve requests. -- Colin 320-221-9531 On Jun 7, 2014, at 12:28 PM, Kevin Burton bur...@spinn3r.com wrote: I just checked the source and in 2.1.0 it's not deprecated. So it *might* be *being* deprecated but I haven't seen anything stating that. On Sat, Jun 7, 2014 at 8:03 AM, Colin colpcl...@gmail.com wrote: I believe Byteorderedpartitioner is being deprecated and for good reason. I would look at what you could achieve by using wide rows and murmur3partitioner. -- Colin 320-221-9531 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote: We have the requirement to have clients read from our tables while they're being written. Basically, any write that we make to cassandra needs to be sent out over the Internet to our customers. We also need them to resume so if they go offline, they can just pick up where they left off. They need to do this in parallel, so if we have 20 cassandra nodes, they can have 20 readers each efficiently (and without coordination) reading from our tables. Here's how we're planning on doing it. We're going to use the ByteOrderedPartitioner . I'm writing with a primary key of the timestamp, however, in practice, this would yield hotspots. (I'm also aware that time isn't a very good pk in a distribute system as I can easily have a collision so we're going to use a scheme similar to a uuid to make it unique per writer). One node would take all the load, followed by the next node, etc. So my plan to stop this is to prefix a slice ID to the timestamp. This way each piece of content has a unique ID, but the prefix will place it on a node. The slide ID is just a byte… so this means there are 255 buckets in which I can place data. This means I can have clients each start with a slice, and a timestamp, and page through the data with tokens. This way I can have a client reading with 255 threads from 255 regions in the cluster, in parallel, without any hot spots. Thoughts on this strategy? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark co...@clark.ws wrote: It's an anti-pattern and there are better ways to do this. Entirely possible :) It would be nice to have a document with a bunch of common cassandra design patterns. I've been trying to track down a pattern for this and a lot of this is pieced in different places an individual blogs posts so one has to reverse engineer it. I have implemented the paging algorithm you've described using wide rows and bucketing. This approach is a more efficient utilization of Cassandra's built in wholesome goodness. So.. I assume the general pattern is to: create a bucket.. you create like 2^16 buckets, this is your partition key. Then you place a timestamp next to the bucket in a primary key. So essentially: primary key( bucket, timestamp )… .. so to read from this buck you essentially execute: select * from foo where bucket = 100 and timestamp 12345790 limit 1; Also, I wouldn't let any number of clients (huge) connect directly the cluster to do this-put some type of app server in between to handle the comm's and fan out. You'll get better utilization of resources and less overhead in addition to flexibility of which data center you're utilizing to serve requests. this is interesting… since the partition is the bucket, you could make some poor decisions based on the number of buckets. For example, if you use 2^64 buckets, the number of items in each bucket is going to be rather small. So you're going to have tons of queries each fetching 0-1 row (if you have a small amount of data). But if you use very FEW buckets.. say 5, but you have a cluster of 1000 nodes, then you will have 5 of these buckets on 5 nodes, and the rest of the nodes without any data. Hm.. the byte ordered partitioner solves this problem because I can just pick a fixed number of buckets and then this is the primary key prefix and the data in a bucket can be split up across machines based on any arbitrary split even in the middle of a 'bucket' … -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
Another way around this is to have a separate table storing the number of buckets. This way if you have too few buckets, you can just increase them in the future. Of course, the older data will still have too few buckets :-( On Sat, Jun 7, 2014 at 11:09 AM, Kevin Burton bur...@spinn3r.com wrote: On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark co...@clark.ws wrote: It's an anti-pattern and there are better ways to do this. Entirely possible :) It would be nice to have a document with a bunch of common cassandra design patterns. I've been trying to track down a pattern for this and a lot of this is pieced in different places an individual blogs posts so one has to reverse engineer it. I have implemented the paging algorithm you've described using wide rows and bucketing. This approach is a more efficient utilization of Cassandra's built in wholesome goodness. So.. I assume the general pattern is to: create a bucket.. you create like 2^16 buckets, this is your partition key. Then you place a timestamp next to the bucket in a primary key. So essentially: primary key( bucket, timestamp )… .. so to read from this buck you essentially execute: select * from foo where bucket = 100 and timestamp 12345790 limit 1; Also, I wouldn't let any number of clients (huge) connect directly the cluster to do this-put some type of app server in between to handle the comm's and fan out. You'll get better utilization of resources and less overhead in addition to flexibility of which data center you're utilizing to serve requests. this is interesting… since the partition is the bucket, you could make some poor decisions based on the number of buckets. For example, if you use 2^64 buckets, the number of items in each bucket is going to be rather small. So you're going to have tons of queries each fetching 0-1 row (if you have a small amount of data). But if you use very FEW buckets.. say 5, but you have a cluster of 1000 nodes, then you will have 5 of these buckets on 5 nodes, and the rest of the nodes without any data. Hm.. the byte ordered partitioner solves this problem because I can just pick a fixed number of buckets and then this is the primary key prefix and the data in a bucket can be split up across machines based on any arbitrary split even in the middle of a 'bucket' … -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
Maybe it makes sense to describe what you're trying to accomplish in more detail. A common bucketing approach is along the lines of year, month, day, hour, minute, etc and then use a timeuuid as a cluster column. Depending upon the semantics of the transport protocol you plan on utilizing, either the client code keep track of pagination, or the app server could, if you utilized some type of request/reply/ack flow. You could keep sequence numbers for each client, and begin streaming data to them or allowing query upon reconnect, etc. But again, more details of the use case might prove useful. -- Colin 320-221-9531 On Jun 7, 2014, at 1:53 PM, Kevin Burton bur...@spinn3r.com wrote: Another way around this is to have a separate table storing the number of buckets. This way if you have too few buckets, you can just increase them in the future. Of course, the older data will still have too few buckets :-( On Sat, Jun 7, 2014 at 11:09 AM, Kevin Burton bur...@spinn3r.com wrote: On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark co...@clark.ws wrote: It's an anti-pattern and there are better ways to do this. Entirely possible :) It would be nice to have a document with a bunch of common cassandra design patterns. I've been trying to track down a pattern for this and a lot of this is pieced in different places an individual blogs posts so one has to reverse engineer it. I have implemented the paging algorithm you've described using wide rows and bucketing. This approach is a more efficient utilization of Cassandra's built in wholesome goodness. So.. I assume the general pattern is to: create a bucket.. you create like 2^16 buckets, this is your partition key. Then you place a timestamp next to the bucket in a primary key. So essentially: primary key( bucket, timestamp )… .. so to read from this buck you essentially execute: select * from foo where bucket = 100 and timestamp 12345790 limit 1; Also, I wouldn't let any number of clients (huge) connect directly the cluster to do this-put some type of app server in between to handle the comm's and fan out. You'll get better utilization of resources and less overhead in addition to flexibility of which data center you're utilizing to serve requests. this is interesting… since the partition is the bucket, you could make some poor decisions based on the number of buckets. For example, if you use 2^64 buckets, the number of items in each bucket is going to be rather small. So you're going to have tons of queries each fetching 0-1 row (if you have a small amount of data). But if you use very FEW buckets.. say 5, but you have a cluster of 1000 nodes, then you will have 5 of these buckets on 5 nodes, and the rest of the nodes without any data. Hm.. the byte ordered partitioner solves this problem because I can just pick a fixed number of buckets and then this is the primary key prefix and the data in a bucket can be split up across machines based on any arbitrary split even in the middle of a 'bucket' … -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote: Maybe it makes sense to describe what you're trying to accomplish in more detail. Essentially , I'm appending writes of recent data by our crawler and sending that data to our customers. They need to sync to up to date writes…we need to get them writes within seconds. A common bucketing approach is along the lines of year, month, day, hour, minute, etc and then use a timeuuid as a cluster column. I mean that is acceptable.. but that means for that 1 minute interval, all writes are going to that one node (and its replicas) So that means the total cluster throughput is bottlenecked on the max disk throughput. Same thing for reads… unless our customers are lagged, they are all going to stampede and ALL of them are going to read data from one node, in a one minute timeframe. That's no fun.. that will easily DoS our cluster. Depending upon the semantics of the transport protocol you plan on utilizing, either the client code keep track of pagination, or the app server could, if you utilized some type of request/reply/ack flow. You could keep sequence numbers for each client, and begin streaming data to them or allowing query upon reconnect, etc. But again, more details of the use case might prove useful. I think if we were to just 100 buckets it would probably work just fine. We're probably not going to be more than 100 nodes in the next year and if we are that's still reasonable performance. I mean if each box has a 400GB SSD that's 40TB of VERY fast data. Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
The add seconds to the bucket. Also, the data will get cached-it's not going to hit disk on every read. Look at the key cache settings on the table. Also, in 2.1 you have even more control over caching. -- Colin 320-221-9531 On Jun 7, 2014, at 4:30 PM, Kevin Burton bur...@spinn3r.com wrote: On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote: Maybe it makes sense to describe what you're trying to accomplish in more detail. Essentially , I'm appending writes of recent data by our crawler and sending that data to our customers. They need to sync to up to date writes…we need to get them writes within seconds. A common bucketing approach is along the lines of year, month, day, hour, minute, etc and then use a timeuuid as a cluster column. I mean that is acceptable.. but that means for that 1 minute interval, all writes are going to that one node (and its replicas) So that means the total cluster throughput is bottlenecked on the max disk throughput. Same thing for reads… unless our customers are lagged, they are all going to stampede and ALL of them are going to read data from one node, in a one minute timeframe. That's no fun.. that will easily DoS our cluster. Depending upon the semantics of the transport protocol you plan on utilizing, either the client code keep track of pagination, or the app server could, if you utilized some type of request/reply/ack flow. You could keep sequence numbers for each client, and begin streaming data to them or allowing query upon reconnect, etc. But again, more details of the use case might prove useful. I think if we were to just 100 buckets it would probably work just fine. We're probably not going to be more than 100 nodes in the next year and if we are that's still reasonable performance. I mean if each box has a 400GB SSD that's 40TB of VERY fast data. Kevin -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
well you could add milliseconds, at best you're still bottlenecking most of your writes one one box.. maybe 2-3 if there are ones that are lagging. Anyway.. I think using 100 buckets is probably fine.. Kevin On Sat, Jun 7, 2014 at 2:45 PM, Colin colpcl...@gmail.com wrote: The add seconds to the bucket. Also, the data will get cached-it's not going to hit disk on every read. Look at the key cache settings on the table. Also, in 2.1 you have even more control over caching. -- Colin 320-221-9531 On Jun 7, 2014, at 4:30 PM, Kevin Burton bur...@spinn3r.com wrote: On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote: Maybe it makes sense to describe what you're trying to accomplish in more detail. Essentially , I'm appending writes of recent data by our crawler and sending that data to our customers. They need to sync to up to date writes…we need to get them writes within seconds. A common bucketing approach is along the lines of year, month, day, hour, minute, etc and then use a timeuuid as a cluster column. I mean that is acceptable.. but that means for that 1 minute interval, all writes are going to that one node (and its replicas) So that means the total cluster throughput is bottlenecked on the max disk throughput. Same thing for reads… unless our customers are lagged, they are all going to stampede and ALL of them are going to read data from one node, in a one minute timeframe. That's no fun.. that will easily DoS our cluster. Depending upon the semantics of the transport protocol you plan on utilizing, either the client code keep track of pagination, or the app server could, if you utilized some type of request/reply/ack flow. You could keep sequence numbers for each client, and begin streaming data to them or allowing query upon reconnect, etc. But again, more details of the use case might prove useful. I think if we were to just 100 buckets it would probably work just fine. We're probably not going to be more than 100 nodes in the next year and if we are that's still reasonable performance. I mean if each box has a 400GB SSD that's 40TB of VERY fast data. Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. :) -- Colin 320-221-9531 On Jun 7, 2014, at 6:53 PM, Kevin Burton bur...@spinn3r.com wrote: well you could add milliseconds, at best you're still bottlenecking most of your writes one one box.. maybe 2-3 if there are ones that are lagging. Anyway.. I think using 100 buckets is probably fine.. Kevin On Sat, Jun 7, 2014 at 2:45 PM, Colin colpcl...@gmail.com wrote: The add seconds to the bucket. Also, the data will get cached-it's not going to hit disk on every read. Look at the key cache settings on the table. Also, in 2.1 you have even more control over caching. -- Colin 320-221-9531 On Jun 7, 2014, at 4:30 PM, Kevin Burton bur...@spinn3r.com wrote: On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote: Maybe it makes sense to describe what you're trying to accomplish in more detail. Essentially , I'm appending writes of recent data by our crawler and sending that data to our customers. They need to sync to up to date writes…we need to get them writes within seconds. A common bucketing approach is along the lines of year, month, day, hour, minute, etc and then use a timeuuid as a cluster column. I mean that is acceptable.. but that means for that 1 minute interval, all writes are going to that one node (and its replicas) So that means the total cluster throughput is bottlenecked on the max disk throughput. Same thing for reads… unless our customers are lagged, they are all going to stampede and ALL of them are going to read data from one node, in a one minute timeframe. That's no fun.. that will easily DoS our cluster. Depending upon the semantics of the transport protocol you plan on utilizing, either the client code keep track of pagination, or the app server could, if you utilized some type of request/reply/ack flow. You could keep sequence numbers for each client, and begin streaming data to them or allowing query upon reconnect, etc. But again, more details of the use case might prove useful. I think if we were to just 100 buckets it would probably work just fine. We're probably not going to be more than 100 nodes in the next year and if we are that's still reasonable performance. I mean if each box has a 400GB SSD that's 40TB of VERY fast data. Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote: With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3. Try to have at least 8 cores, 32 gig ram, and separate disks for log and data. Will you be replicating data across data centers? -- Colin 320-221-9531 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote: Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote: With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Data model for streaming a large table in real time.
Right now I'm just putting everything together as a proof of concept… so just two cheap replicas for now. And it's at 1/1th of the load. If we lose data it's ok :) I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores, probably 48-64GB of RAM each box. Just one datacenter for now… We're probably going to be migrating to using linux containers at some point. This way we can have like 16GB , one 400GB SSD, 4 cores for each image. And we can ditch the RAID which is nice. :) On Sat, Jun 7, 2014 at 7:51 PM, Colin colpcl...@gmail.com wrote: To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3. Try to have at least 8 cores, 32 gig ram, and separate disks for log and data. Will you be replicating data across data centers? -- Colin 320-221-9531 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote: Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote: With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations
Re: Data model for streaming a large table in real time.
Write Consistency Level + Read Consistency Level Replication Factor ensure your reads will read consistently and having 3 nodes lets you achieve redundancy in event of node failure. So writing with CL of local quorum and reading with CL of local quorum (2+23) with replication factor of 3 ensures reads and protection against losing a node. In event of losing a node, you can downgrade the CL automatically and then also accept a little eventual consistency. -- Colin 320-221-9531 On Jun 7, 2014, at 10:03 PM, James Campbell ja...@breachintelligence.com wrote: This is a basic question, but having heard that advice before, I'm curious about why the minimum recommended replication factor is three? Certainly additional redundancy, and, I believe, a minimum threshold for paxos. Are there other reasons? On Jun 7, 2014 10:52 PM, Colin colpcl...@gmail.com wrote: To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3. Try to have at least 8 cores, 32 gig ram, and separate disks for log and data. Will you be replicating data across data centers? -- Colin 320-221-9531 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote: Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote: With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength.
Re: Data model for streaming a large table in real time.
You won't need containers - running one instance of Cassandra in that configuration will hum along quite nicely and will make use of the cores and memory. I'd forget the raid anyway and just mount the disks separately (jbod) -- Colin 320-221-9531 On Jun 7, 2014, at 10:02 PM, Kevin Burton bur...@spinn3r.com wrote: Right now I'm just putting everything together as a proof of concept… so just two cheap replicas for now. And it's at 1/1th of the load. If we lose data it's ok :) I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores, probably 48-64GB of RAM each box. Just one datacenter for now… We're probably going to be migrating to using linux containers at some point. This way we can have like 16GB , one 400GB SSD, 4 cores for each image. And we can ditch the RAID which is nice. :) On Sat, Jun 7, 2014 at 7:51 PM, Colin colpcl...@gmail.com wrote: To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3. Try to have at least 8 cores, 32 gig ram, and separate disks for log and data. Will you be replicating data across data centers? -- Colin 320-221-9531 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote: Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote: With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely appreciate the feedback! Thanks! -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength.
Re: Data model for streaming a large table in real time.
we're using containers for other reasons, not just cassandra. Tightly constraining resources means we don't have to worry about cassandra , the JVM , or Linux doing something silly and using too many resources and taking down the whole box. On Sat, Jun 7, 2014 at 8:25 PM, Colin Clark co...@clark.ws wrote: You won't need containers - running one instance of Cassandra in that configuration will hum along quite nicely and will make use of the cores and memory. I'd forget the raid anyway and just mount the disks separately (jbod) -- Colin 320-221-9531 On Jun 7, 2014, at 10:02 PM, Kevin Burton bur...@spinn3r.com wrote: Right now I'm just putting everything together as a proof of concept… so just two cheap replicas for now. And it's at 1/1th of the load. If we lose data it's ok :) I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores, probably 48-64GB of RAM each box. Just one datacenter for now… We're probably going to be migrating to using linux containers at some point. This way we can have like 16GB , one 400GB SSD, 4 cores for each image. And we can ditch the RAID which is nice. :) On Sat, Jun 7, 2014 at 7:51 PM, Colin colpcl...@gmail.com wrote: To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3. Try to have at least 8 cores, 32 gig ram, and separate disks for log and data. Will you be replicating data across data centers? -- Colin 320-221-9531 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote: Oh.. To start with we're going to use from 2-10 nodes.. I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :) Thanks! On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote: With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key. You seem to be set in your current direction. Let us know how it works out. -- Colin 320-221-9531 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote: What's 'source' ? You mean like the URL? If source too random it's going to yield too many buckets. Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB… On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote: Not if you add another column to the partition key; source for example. I would really try to stay away from the ordered partitioner if at all possible. What ingestion rates are you expecting, in size and speed. -- Colin 320-221-9531 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote: Thanks for the feedback on this btw.. .it's helpful. My notes below. On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote: No, you're not-the partition key will get distributed across the cluster if you're using random or murmur. Yes… I'm aware. But in practice this is how it will work… If we create bucket b0, that will get hashed to h0… So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time. That gets hashed to h0… If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval. So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time. You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns) I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later. Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store. Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially. I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients. Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :) Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't. Definitely
Data model for streaming a large table in real time.
We have the requirement to have clients read from our tables while they're being written. Basically, any write that we make to cassandra needs to be sent out over the Internet to our customers. We also need them to resume so if they go offline, they can just pick up where they left off. They need to do this in parallel, so if we have 20 cassandra nodes, they can have 20 readers each efficiently (and without coordination) reading from our tables. Here's how we're planning on doing it. We're going to use the ByteOrderedPartitioner . I'm writing with a primary key of the timestamp, however, in practice, this would yield hotspots. (I'm also aware that time isn't a very good pk in a distribute system as I can easily have a collision so we're going to use a scheme similar to a uuid to make it unique per writer). One node would take all the load, followed by the next node, etc. So my plan to stop this is to prefix a slice ID to the timestamp. This way each piece of content has a unique ID, but the prefix will place it on a node. The slide ID is just a byte… so this means there are 255 buckets in which I can place data. This means I can have clients each start with a slice, and a timestamp, and page through the data with tokens. This way I can have a client reading with 255 threads from 255 regions in the cluster, in parallel, without any hot spots. Thoughts on this strategy? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.