Re: Data model for streaming a large table in real time.

Colin Sat, 07 Jun 2014 19:52:27 -0700

To have any redundancy in the system, start with at least 3 nodes and a 
replication factor of 3.


Try to have at least 8 cores, 32 gig ram, and separate disks for log and data.

Will you be replicating data across data centers?

--
Colin
320-221-9531


> On Jun 7, 2014, at 9:40 PM, Kevin Burton <bur...@spinn3r.com> wrote:
> 
> Oh.. To start with we're going to use from 2-10 nodes.. 
> 
> I think we're going to take the original strategy and just to use 100 buckets 
> .. 0-99… then the timestamp under that..  I think it should be fine and won't 
> require an ordered partitioner. :)
> 
> Thanks!
> 
> 
>> On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark <co...@clark.ws> wrote:
>> With 100 nodes, that ingestion rate is actually quite low and I don't think 
>> you'd need another column in the partition key.
>> 
>> You seem to be set in your current direction.  Let us know how it works out.
>> 
>> --
>> Colin
>> 320-221-9531
>> 
>> 
>>> On Jun 7, 2014, at 9:18 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>> 
>>> What's 'source' ? You mean like the URL?
>>> 
>>> If source too random it's going to yield too many buckets.  
>>> 
>>> Ingestion rates are fairly high but not insane.  About 4M inserts per 
>>> hour.. from 5-10GB… 
>>> 
>>> 
>>>> On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws> wrote:
>>>> Not if you add another column to the partition key; source for example.  
>>>> 
>>>> I would really try to stay away from the ordered partitioner if at all 
>>>> possible.
>>>> 
>>>> What ingestion rates are you expecting, in size and speed.
>>>> 
>>>> --
>>>> Colin
>>>> 320-221-9531
>>>> 
>>>> 
>>>>> On Jun 7, 2014, at 9:05 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>>>> 
>>>>> 
>>>>> Thanks for the feedback on this btw.. .it's helpful.  My notes below.
>>>>> 
>>>>>> On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws> wrote:
>>>>>> No, you're not-the partition key will get distributed across the cluster 
>>>>>> if you're using random or murmur.
>>>>> 
>>>>> Yes… I'm aware.  But in practice this is how it will work…
>>>>> 
>>>>> If we create bucket b0, that will get hashed to h0…
>>>>> 
>>>>> So say I have 50 machines performing writes, they are all on the same 
>>>>> time thanks to ntpd, so they all compute b0 for the current bucket based 
>>>>> on the time.
>>>>> 
>>>>> That gets hashed to h0…
>>>>> 
>>>>> If h0 is hosted on node0 … then all writes go to node zero for that 1 
>>>>> second interval.
>>>>> 
>>>>> So all my writes are bottlenecking on one node.  That node is *changing* 
>>>>> over time… but they're not being dispatched in parallel over N nodes.  At 
>>>>> most writes will only ever reach 1 node a time.
>>>>> 
>>>>>  
>>>>>> You could also ensure that by adding another column, like source to 
>>>>>> ensure distribution. (Add the seconds to the partition key, not the 
>>>>>> clustering columns)
>>>>>> 
>>>>>> I can almost guarantee that if you put too much thought into working 
>>>>>> against what Cassandra offers out of the box, that it will bite you 
>>>>>> later.
>>>>> 
>>>>> Sure.. I'm trying to avoid the 'bite you later' issues. More so because 
>>>>> I'm sure there are Cassandra gotchas to worry about.  Everything has 
>>>>> them.  Just trying to avoid the land mines :-P
>>>>>  
>>>>>> In fact, the use case that you're describing may best be served by a 
>>>>>> queuing mechanism, and using Cassandra only for the underlying store.
>>>>> 
>>>>> Yes… that's what I'm doing.  We're using apollo to fan out the queue, but 
>>>>> the writes go back into cassandra and needs to be read out sequentially.
>>>>>  
>>>>>> 
>>>>>> I used this exact same approach in a use case that involved writing over 
>>>>>> a million events/second to a cluster with no problems.  Initially, I 
>>>>>> thought ordered partitioner was the way to go too.  And I used separate 
>>>>>> processes to aggregate, conflate, and handle distribution to clients.
>>>>> 
>>>>> 
>>>>> Yes. I think using 100 buckets will work for now.  Plus I don't have to 
>>>>> change the partitioner on our existing cluster and I'm lazy :)
>>>>>  
>>>>>> 
>>>>>> Just my two cents, but I also spend the majority of my days helping 
>>>>>> people utilize Cassandra correctly, and rescuing those that haven't.
>>>>> 
>>>>> Definitely appreciate the feedback!  Thanks!
>>>>>  
>>>>> -- 
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: San Francisco, CA
>>>>> Skype: burtonator
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> 
>>>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
>>>>> people.
>>> 
>>> 
>>> 
>>> -- 
>>> Founder/CEO Spinn3r.com
>>> Location: San Francisco, CA
>>> Skype: burtonator
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> 
>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
>>> people.
> 
> 
> 
> -- 
> Founder/CEO Spinn3r.com
> Location: San Francisco, CA
> Skype: burtonator
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
> people.

Re: Data model for streaming a large table in real time.

Reply via email to