Re: Data model for streaming a large table in real time.

Colin Clark Sat, 07 Jun 2014 17:15:06 -0700

No, you're not-the partition key will get distributed across the cluster if
you're using random or murmur.  You could also ensure that by adding
another column, like source to ensure distribution. (Add the seconds to the
partition key, not the clustering columns)

I can almost guarantee that if you put too much thought into working
against what Cassandra offers out of the box, that it will bite you later.

In fact, the use case that you're describing may best be served by a
queuing mechanism, and using Cassandra only for the underlying store.

I used this exact same approach in a use case that involved writing over a
million events/second to a cluster with no problems.  Initially, I thought
ordered partitioner was the way to go too.  And I used separate processes
to aggregate, conflate, and handle distribution to clients.

Just my two cents, but I also spend the majority of my days helping people
utilize Cassandra correctly, and rescuing those that haven't.

:)

--
Colin
320-221-9531

On Jun 7, 2014, at 6:53 PM, Kevin Burton <bur...@spinn3r.com> wrote:

well you could add milliseconds, at best you're still bottlenecking most of
your writes one one box.. maybe 2-3 if there are ones that are lagging.

Anyway.. I think using 100 buckets is probably fine..

Kevin

On Sat, Jun 7, 2014 at 2:45 PM, Colin <colpcl...@gmail.com> wrote:

> The add seconds to the bucket.  Also, the data will get cached-it's not
> going to hit disk on every read.
>
> Look at the key cache settings on the table.  Also, in 2.1 you have even
> more control over caching.
>
> --
> Colin
> 320-221-9531
>
>
> On Jun 7, 2014, at 4:30 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>
> On Sat, Jun 7, 2014 at 1:34 PM, Colin <colpcl...@gmail.com> wrote:
>
>> Maybe it makes sense to describe what you're trying to accomplish in more
>> detail.
>>
>>
> Essentially , I'm appending writes of recent data by our crawler and
> sending that data to our customers.
>
> They need to sync to up to date writes…we need to get them writes within
> seconds.
>
> A common bucketing approach is along the lines of year, month, day, hour,
>> minute, etc and then use a timeuuid as a cluster column.
>>
>>
> I mean that is acceptable.. but that means for that 1 minute interval, all
> writes are going to that one node (and its replicas)
>
> So that means the total cluster throughput is bottlenecked on the max disk
> throughput.
>
> Same thing for reads… unless our customers are lagged, they are all going
> to stampede and ALL of them are going to read data from one node, in a one
> minute timeframe.
>
> That's no fun..  that will easily DoS our cluster.
>
>
>> Depending upon the semantics of the transport protocol you plan on
>> utilizing, either the client code keep track of pagination, or the app
>> server could, if you utilized some type of request/reply/ack flow.  You
>> could keep sequence numbers for each client, and begin streaming data to
>> them or allowing query upon reconnect, etc.
>>
>> But again, more details of the use case might prove useful.
>>
>>
> I think if we were to just 100 buckets it would probably work just fine.
>  We're probably not going to be more than 100 nodes in the next year and if
> we are that's still reasonable performance.
>
> I mean if each box has a 400GB SSD that's 40TB of VERY fast data.
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Reply via email to