Re: Data model for streaming a large table in real time.

Colin Clark Sat, 07 Jun 2014 20:20:07 -0700

Write Consistency Level + Read Consistency Level > Replication Factor
ensure your reads will read consistently and having 3 nodes lets you
achieve redundancy in event of node failure.


So writing with CL of local quorum and reading with CL of local quorum
(2+2>3) with replication factor of 3 ensures reads and protection against
losing a node.

In event of losing a node, you can downgrade the CL automatically and then
also accept a little eventual consistency.


--
Colin
320-221-9531


On Jun 7, 2014, at 10:03 PM, James Campbell <ja...@breachintelligence.com>
wrote:

 This is a basic question, but having heard that advice before, I'm curious
about why the minimum recommended replication factor is three? Certainly
additional redundancy, and, I believe, a minimum threshold for paxos. Are
there other reasons?
On Jun 7, 2014 10:52 PM, Colin <colpcl...@gmail.com> wrote:
 To have any redundancy in the system, start with at least 3 nodes and a
replication factor of 3.

 Try to have at least 8 cores, 32 gig ram, and separate disks for log and
data.

 Will you be replicating data across data centers?

-- 
Colin
320-221-9531


On Jun 7, 2014, at 9:40 PM, Kevin Burton <bur...@spinn3r.com> wrote:

  Oh.. To start with we're going to use from 2-10 nodes..

 I think we're going to take the original strategy and just to use 100
buckets .. 0-99… then the timestamp under that..  I think it should be fine
and won't require an ordered partitioner. :)

 Thanks!


On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark <co...@clark.ws> wrote:

>  With 100 nodes, that ingestion rate is actually quite low and I don't
> think you'd need another column in the partition key.
>
>  You seem to be set in your current direction.  Let us know how it works
> out.
>
> --
> Colin
> 320-221-9531
>
>
> On Jun 7, 2014, at 9:18 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>   What's 'source' ? You mean like the URL?
>
>  If source too random it's going to yield too many buckets.
>
>  Ingestion rates are fairly high but not insane.  About 4M inserts per
> hour.. from 5-10GB…
>
>
> On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws> wrote:
>
>>  Not if you add another column to the partition key; source for example.
>>
>>
>>  I would really try to stay away from the ordered partitioner if at all
>> possible.
>>
>>  What ingestion rates are you expecting, in size and speed.
>>
>> --
>> Colin
>> 320-221-9531
>>
>>
>> On Jun 7, 2014, at 9:05 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>
>>
>>  Thanks for the feedback on this btw.. .it's helpful.  My notes below.
>>
>> On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws> wrote:
>>
>>>  No, you're not-the partition key will get distributed across the
>>> cluster if you're using random or murmur.
>>>
>>
>>  Yes… I'm aware.  But in practice this is how it will work…
>>
>>  If we create bucket b0, that will get hashed to h0…
>>
>>  So say I have 50 machines performing writes, they are all on the same
>> time thanks to ntpd, so they all compute b0 for the current bucket based on
>> the time.
>>
>>  That gets hashed to h0…
>>
>>  If h0 is hosted on node0 … then all writes go to node zero for that 1
>> second interval.
>>
>>  So all my writes are bottlenecking on one node.  That node is
>> *changing* over time… but they're not being dispatched in parallel over N
>> nodes.  At most writes will only ever reach 1 node a time.
>>
>>
>>
>>>  You could also ensure that by adding another column, like source to
>>> ensure distribution. (Add the seconds to the partition key, not the
>>> clustering columns)
>>>
>>>  I can almost guarantee that if you put too much thought into working
>>> against what Cassandra offers out of the box, that it will bite you later.
>>>
>>>
>>  Sure.. I'm trying to avoid the 'bite you later' issues. More so because
>> I'm sure there are Cassandra gotchas to worry about.  Everything has them.
>>  Just trying to avoid the land mines :-P
>>
>>
>>>  In fact, the use case that you're describing may best be served by a
>>> queuing mechanism, and using Cassandra only for the underlying store.
>>>
>>
>>  Yes… that's what I'm doing.  We're using apollo to fan out the queue,
>> but the writes go back into cassandra and needs to be read out sequentially.
>>
>>
>>>
>>>  I used this exact same approach in a use case that involved writing
>>> over a million events/second to a cluster with no problems.  Initially, I
>>> thought ordered partitioner was the way to go too.  And I used separate
>>> processes to aggregate, conflate, and handle distribution to clients.
>>>
>>
>>
>>  Yes. I think using 100 buckets will work for now.  Plus I don't have to
>> change the partitioner on our existing cluster and I'm lazy :)
>>
>>
>>>
>>>  Just my two cents, but I also spend the majority of my days helping
>>> people utilize Cassandra correctly, and rescuing those that haven't.
>>>
>>>
>>  Definitely appreciate the feedback!  Thanks!
>>
>>  --
>>
>>  Founder/CEO Spinn3r.com
>>  Location: *San Francisco, CA*
>> Skype: *burtonator*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>  <http://spinn3r.com>
>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
>> people.
>>
>>
>
>
>  --
>
>  Founder/CEO Spinn3r.com
>  Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>  <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>


 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
 <http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Reply via email to