Re: Data model for streaming a large table in real time.

Kevin Burton Sat, 07 Jun 2014 11:54:06 -0700

Another way around this is to have a separate table storing the number of
buckets.


This way if you have too few buckets, you can just increase them in the
future.

Of course, the older data will still have too few buckets :-(


On Sat, Jun 7, 2014 at 11:09 AM, Kevin Burton <bur...@spinn3r.com> wrote:

>
> On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark <co...@clark.ws> wrote:
>
>> It's an anti-pattern and there are better ways to do this.
>>
>>
> Entirely possible :)
>
> It would be nice to have a document with a bunch of common cassandra
> design patterns.
>
> I've been trying to track down a pattern for this and a lot of this is
> pieced in different places an individual blogs posts so one has to reverse
> engineer it.
>
>
>> I have implemented the paging algorithm you've described using wide rows
>> and bucketing.  This approach is a more efficient utilization of
>> Cassandra's built in wholesome goodness.
>>
>
> So.. I assume the general pattern is to:
>
> create a bucket.. you create like 2^16 buckets, this is your partition
> key.
>
> Then you place a timestamp next to the bucket in a primary key.
>
> So essentially:
>
> primary key( bucket, timestamp )…
>
> .. so to read from this buck you essentially execute:
>
> select * from foo where bucket = 100 and timestamp > 12345790 limit 10000;
>
>
>>
>> Also, I wouldn't let any number of clients (huge) connect directly the
>> cluster to do this-put some type of app server in between to handle the
>> comm's and fan out.  You'll get better utilization of resources and less
>> overhead in addition to flexibility of which data center you're utilizing
>> to serve requests.
>>
>>
> this is interesting… since the partition is the bucket, you could make
> some poor decisions based on the number of buckets.
>
> For example,
>
> if you use 2^64 buckets, the number of items in each bucket is going to be
> rather small.  So you're going to have tons of queries each fetching 0-1
> row (if you have a small amount of data).
>
> But if you use very FEW buckets.. say 5, but you have a cluster of 1000
> nodes, then you will have 5 of these buckets on 5 nodes, and the rest of
> the nodes without any data.
>
> Hm..
>
> the byte ordered partitioner solves this problem because I can just pick a
> fixed number of buckets and then this is the primary key prefix and the
> data in a bucket can be split up across machines based on any arbitrary
> split even in the middle of a 'bucket' …
>
>
> --
>
> Founder/CEO Spinn3r.com
>  Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Reply via email to