On Sat, Jun 7, 2014 at 1:34 PM, Colin <colpcl...@gmail.com> wrote: > Maybe it makes sense to describe what you're trying to accomplish in more > detail. > > Essentially , I'm appending writes of recent data by our crawler and sending that data to our customers.
They need to sync to up to date writes…we need to get them writes within seconds. A common bucketing approach is along the lines of year, month, day, hour, > minute, etc and then use a timeuuid as a cluster column. > > I mean that is acceptable.. but that means for that 1 minute interval, all writes are going to that one node (and its replicas) So that means the total cluster throughput is bottlenecked on the max disk throughput. Same thing for reads… unless our customers are lagged, they are all going to stampede and ALL of them are going to read data from one node, in a one minute timeframe. That's no fun.. that will easily DoS our cluster. > Depending upon the semantics of the transport protocol you plan on > utilizing, either the client code keep track of pagination, or the app > server could, if you utilized some type of request/reply/ack flow. You > could keep sequence numbers for each client, and begin streaming data to > them or allowing query upon reconnect, etc. > > But again, more details of the use case might prove useful. > > I think if we were to just 100 buckets it would probably work just fine. We're probably not going to be more than 100 nodes in the next year and if we are that's still reasonable performance. I mean if each box has a 400GB SSD that's 40TB of VERY fast data. Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts> <http://spinn3r.com> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.