Re: Tuning Replication Factor - All, Consistency ONE

2018-07-11 Thread Jürgen Albersdorfer
And by all means, do not treat Cassandra as a relational Database. - Beware
of the limitations of CQL in contrast to SQL.
I don't want to argue angainst Cassandra because I like it for what it was
primarly designed - horizontal scalability for HUGE amounts of data.
It is good to access your Data by Key, but not for searching. High
Availibility is a nice giveaway here.
If you end having only one Table in C*, maybe something like Redis would
work for your needs, too.

Some hints from my own expierience with it - if you choose to use Cassandra:
Have at least as much Racks as Replication Factor - Replication Factor of 3
means you want to have at least 3 Racks.
Choose your Partitioning wisely - it starts becoming relevant from 10 mio.
records onwards.

regards,
Jürgen

Am Di., 10. Juli 2018 um 18:18 Uhr schrieb Jeff Jirsa :

>
>
> On Tue, Jul 10, 2018 at 8:29 AM, Code Wiget  wrote:
>
>> Hi,
>>
>> I have been tasked with picking and setting up a database with the
>> following characteristics:
>>
>>- Ultra-high availability - The real requirement is uptime - our
>>whole platform becomes inaccessible without a “read” from the database. We
>>need the read to authenticate users. Databases will never be spread across
>>multiple networks.
>>
>>
> Sooner or later life will happen and you're going to have some
> unavailability - may be worth taking the time to make it fail gracefully
> (cache auth responses, etc).
>
>>
>>- Reasonably quick access speeds
>>- Very low data storage - The data storage is very low - for 10
>>million users, we would have around 8GB of storage total.
>>
>> Having done a bit of research on Cassandra, I think the optimal approach
>> for my use-case would be to replicate the data on *ALL* nodes possible,
>> but require reads to only have a consistency level of one. So, in the case
>> that a node goes down, we can still read/write to other nodes. It is not
>> very important that a read be unanimously agreed upon, as long as Cassandra
>> is eventually consistent, within around 1s, then there shouldn’t be an
>> issue.
>>
>
> Seems like a reasonably good fit, but there's no 1s guarantee - it'll
> USUALLY happen within milliseconds, but the edge cases don't have a strict
> guarantee at all (imagine two hosts in adjacent racks, the link between the
> two racks goes down, but both are otherwise functional - a query at ONE in
> either rack would be able to read and write data, but it would diverge
> between the two racks for some period of time).
>
>
>>
>> When I go to set up the database though, I am required to set a
>> replication factor to a number - 1,2,3,etc. So I can’t just say “ALL” and
>> have it replicate to all nodes.
>>
>
> That option doesn't exist. It's been proposed (and exists in Datastax
> Enterprise, which is a proprietary fork), but reportedly causes quite a bit
> of pain when misused, so people have successfully lobbied against it's
> inclusion in OSS Apache Cassandra. You could (assuming some basic java
> knowledge) extend NetworkTopologyStrategy to have it accomplish this, but I
> imagine you don't REALLY want this unless you're frequently auto-scaling
> nodes in/out of the cluster. You should probably just pick a high RF and
> you'll be OK with it.
>
>
>> Right now, I have a 2 node cluster with replication factor 3. Will this
>> cause any issues, having a RF > #nodes? Or is there a way to just have it
>> copy to *all* nodes?
>>
>
> It's obviously not the intended config, but I don't think it'll cause many
> problems.
>
>
>> Is there any way that I can tune Cassandra to be more read-optimized?
>>
>>
> Yes - definitely use leveled compaction instead of STCS (the default), and
> definitely take the time to tune the JVM args - read path generates a lot
> of short lived java objects, so a larger eden will help you (maybe up to
> 40-50% of max heap size).
>
>
>> Finally, I have some misgivings about how well Cassandra fits my
>> use-case. Please, if anyone has a suggestion as to why or why not it is a
>> good fit, I would really appreciate your input! If this could be done with
>> a simple SQL database and this is overkill, please let me know.
>>
>> Thanks for your input!
>>
>>
>


Re: Tuning Replication Factor - All, Consistency ONE

2018-07-10 Thread Jeff Jirsa
On Tue, Jul 10, 2018 at 8:29 AM, Code Wiget  wrote:

> Hi,
>
> I have been tasked with picking and setting up a database with the
> following characteristics:
>
>- Ultra-high availability - The real requirement is uptime - our whole
>platform becomes inaccessible without a “read” from the database. We need
>the read to authenticate users. Databases will never be spread across
>multiple networks.
>
>
Sooner or later life will happen and you're going to have some
unavailability - may be worth taking the time to make it fail gracefully
(cache auth responses, etc).

>
>- Reasonably quick access speeds
>- Very low data storage - The data storage is very low - for 10
>million users, we would have around 8GB of storage total.
>
> Having done a bit of research on Cassandra, I think the optimal approach
> for my use-case would be to replicate the data on *ALL* nodes possible,
> but require reads to only have a consistency level of one. So, in the case
> that a node goes down, we can still read/write to other nodes. It is not
> very important that a read be unanimously agreed upon, as long as Cassandra
> is eventually consistent, within around 1s, then there shouldn’t be an
> issue.
>

Seems like a reasonably good fit, but there's no 1s guarantee - it'll
USUALLY happen within milliseconds, but the edge cases don't have a strict
guarantee at all (imagine two hosts in adjacent racks, the link between the
two racks goes down, but both are otherwise functional - a query at ONE in
either rack would be able to read and write data, but it would diverge
between the two racks for some period of time).


>
> When I go to set up the database though, I am required to set a
> replication factor to a number - 1,2,3,etc. So I can’t just say “ALL” and
> have it replicate to all nodes.
>

That option doesn't exist. It's been proposed (and exists in Datastax
Enterprise, which is a proprietary fork), but reportedly causes quite a bit
of pain when misused, so people have successfully lobbied against it's
inclusion in OSS Apache Cassandra. You could (assuming some basic java
knowledge) extend NetworkTopologyStrategy to have it accomplish this, but I
imagine you don't REALLY want this unless you're frequently auto-scaling
nodes in/out of the cluster. You should probably just pick a high RF and
you'll be OK with it.


> Right now, I have a 2 node cluster with replication factor 3. Will this
> cause any issues, having a RF > #nodes? Or is there a way to just have it
> copy to *all* nodes?
>

It's obviously not the intended config, but I don't think it'll cause many
problems.


> Is there any way that I can tune Cassandra to be more read-optimized?
>
>
Yes - definitely use leveled compaction instead of STCS (the default), and
definitely take the time to tune the JVM args - read path generates a lot
of short lived java objects, so a larger eden will help you (maybe up to
40-50% of max heap size).


> Finally, I have some misgivings about how well Cassandra fits my use-case.
> Please, if anyone has a suggestion as to why or why not it is a good fit, I
> would really appreciate your input! If this could be done with a simple SQL
> database and this is overkill, please let me know.
>
> Thanks for your input!
>
>


Tuning Replication Factor - All, Consistency ONE

2018-07-10 Thread Code Wiget
Hi,

I have been tasked with picking and setting up a database with the following 
characteristics:

• Ultra-high availability - The real requirement is uptime - our whole platform 
becomes inaccessible without a “read” from the database. We need the read to 
authenticate users. Databases will never be spread across multiple networks.
• Reasonably quick access speeds
• Very low data storage - The data storage is very low - for 10 million users, 
we would have around 8GB of storage total.

Having done a bit of research on Cassandra, I think the optimal approach for my 
use-case would be to replicate the data on ALL nodes possible, but require 
reads to only have a consistency level of one. So, in the case that a node goes 
down, we can still read/write to other nodes. It is not very important that a 
read be unanimously agreed upon, as long as Cassandra is eventually consistent, 
within around 1s, then there shouldn’t be an issue.

When I go to set up the database though, I am required to set a replication 
factor to a number - 1,2,3,etc. So I can’t just say “ALL” and have it replicate 
to all nodes. Right now, I have a 2 node cluster with replication factor 3. 
Will this cause any issues, having a RF > #nodes? Or is there a way to just 
have it copy to all nodes? Is there any way that I can tune Cassandra to be 
more read-optimized?

Finally, I have some misgivings about how well Cassandra fits my use-case. 
Please, if anyone has a suggestion as to why or why not it is a good fit, I 
would really appreciate your input! If this could be done with a simple SQL 
database and this is overkill, please let me know.

Thanks for your input!