Re: Practical limitations of too many columns/cells ?

2015-08-23 Thread Kevin Burton
Agreed.  We’re going to run a benchmark.  Just realized we grew to 144
columns.  Fun.  Kind of disappointing that Cassandra is so slow in this
regard.  Kind of defeats the whole point of flexible schema if actually
using that feature is slow as hell.

On Sun, Aug 23, 2015 at 4:54 PM, Jeff Jirsa 
wrote:

> The key is to benchmark it with your real data. Modern cassandra-stress
> let’s you get very close to your actual read/write behavior, and the real
> differentiator will depend on your use case (how often do you write the
> whole row vs updating just one column/field). My gist shows a ton of
> different examples, but they’re not scientific, and at this point they’re
> old versions (and performance varies version to version).
>
> - Jeff
>
> From:  on behalf of Kevin Burton
> Reply-To: "user@cassandra.apache.org"
> Date: Sunday, August 23, 2015 at 2:58 PM
> To: "user@cassandra.apache.org"
> Subject: Re: Practical limitations of too many columns/cells ?
>
> Ah.. yes.  Great benchmarks. If I’m interpreting them correctly it was
> ~15x slower for 22 columns vs 2 columns?
>
> Guess we have to refactor again :-P
>
> Not the end of the world of course.
>
> On Sun, Aug 23, 2015 at 1:53 PM, Jeff Jirsa 
> wrote:
>
>> A few months back, a user in #cassandra on freenode mentioned that when
>> they transitioned from thrift to cql, their overall performance decreased
>> significantly. They had 66 columns per table, so I ran some benchmarks with
>> various versions of Cassandra and thrift/cql combinations.
>>
>> It shouldn’t really surprise you that more columns = more work = slower
>> operations. It’s not necessarily the size of the writes, but the amount of
>> work that needs to be done with the extra cells (2 large columns totaling
>> 2k performs better than 66 small columns totaling 0.66k even though it’s
>> three times as much raw data being written to disk)
>>
>> https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c
>>
>> 2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660
>> bytes per): cassandra-stress --operation INSERT --num-keys 100
>> --columns 66 --column-size=10 --replication-factor 2 --nodesfile=nodes
>> Averages from the middle 80% of values: interval_op_rate : 10720
>>
>> 2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200
>> bytes per): cassandra-stress --operation INSERT --num-keys 100
>> --columns 20 --column-size=10 --replication-factor 2 --nodesfile=nodes
>> Averages from the middle 80% of values: interval_op_rate : 28667
>>
>> 2.0.13, 2 tokens per node, 2 large columns, thrift (2048 bytes per):
>> cassandra-stress --operation INSERT --num-keys 100 --columns 2
>> --column-size=1024 --replication-factor 2 --nodesfile=nodes Averages
>> from the middle 80% of values: interval_op_rate : 23489
>>
>> From:  on behalf of Kevin Burton
>> Reply-To: "user@cassandra.apache.org"
>> Date: Sunday, August 23, 2015 at 1:02 PM
>> To: "user@cassandra.apache.org"
>> Subject: Practical limitations of too many columns/cells ?
>>
>> Is there any advantage to using say 40 columns per row vs using 2 columns
>> (one for the pk and the other for data) and then shoving the data into a
>> BLOB as a JSON object?
>>
>> To date, we’ve been just adding new columns.  I profiled Cassandra and
>> about 50% of the CPU time is spent on CPU doing compactions.  Seeing that
>> CS is being CPU bottlenecked maybe this is a way I can optimize it.
>>
>> Any thoughts?
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Practical limitations of too many columns/cells ?

2015-08-23 Thread Jeff Jirsa
The key is to benchmark it with your real data. Modern cassandra-stress let’s 
you get very close to your actual read/write behavior, and the real 
differentiator will depend on your use case (how often do you write the whole 
row vs updating just one column/field). My gist shows a ton of different 
examples, but they’re not scientific, and at this point they’re old versions 
(and performance varies version to version). 

- Jeff

From:   on behalf of Kevin Burton
Reply-To:  "user@cassandra.apache.org"
Date:  Sunday, August 23, 2015 at 2:58 PM
To:  "user@cassandra.apache.org"
Subject:  Re: Practical limitations of too many columns/cells ?

Ah.. yes.  Great benchmarks. If I’m interpreting them correctly it was ~15x 
slower for 22 columns vs 2 columns? 

Guess we have to refactor again :-P

Not the end of the world of course.  

On Sun, Aug 23, 2015 at 1:53 PM, Jeff Jirsa  wrote:
A few months back, a user in #cassandra on freenode mentioned that when they 
transitioned from thrift to cql, their overall performance decreased 
significantly. They had 66 columns per table, so I ran some benchmarks with 
various versions of Cassandra and thrift/cql combinations.

It shouldn’t really surprise you that more columns = more work = slower 
operations. It’s not necessarily the size of the writes, but the amount of work 
that needs to be done with the extra cells (2 large columns totaling 2k 
performs better than 66 small columns totaling 0.66k even though it’s three 
times as much raw data being written to disk)

https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c

2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660 bytes 
per):
cassandra-stress --operation INSERT --num-keys 100 --columns 66 
--column-size=10 --replication-factor 2 --nodesfile=nodes
Averages from the middle 80% of values:
interval_op_rate : 10720


2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200 bytes 
per):
cassandra-stress --operation INSERT --num-keys 100 --columns 20 
--column-size=10 --replication-factor 2 --nodesfile=nodes
Averages from the middle 80% of values:
interval_op_rate : 28667


2.0.13, 2 tokens per node, 2 large columns, thrift (2048 bytes per):
cassandra-stress --operation INSERT --num-keys 100 --columns 2 
--column-size=1024 --replication-factor 2 --nodesfile=nodes
Averages from the middle 80% of values:
interval_op_rate : 23489


From:  on behalf of Kevin Burton
Reply-To: "user@cassandra.apache.org"
Date: Sunday, August 23, 2015 at 1:02 PM
To: "user@cassandra.apache.org"
Subject: Practical limitations of too many columns/cells ?

Is there any advantage to using say 40 columns per row vs using 2 columns (one 
for the pk and the other for data) and then shoving the data into a BLOB as a 
JSON object? 

To date, we’ve been just adding new columns.  I profiled Cassandra and about 
50% of the CPU time is spent on CPU doing compactions.  Seeing that CS is being 
CPU bottlenecked maybe this is a way I can optimize it.

Any thoughts?

-- 
Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




-- 
Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




smime.p7s
Description: S/MIME cryptographic signature


Store JSON as text or UTF-8 encoded blobs?

2015-08-23 Thread Kevin Burton
Hey.

I’m considering migrating my DB from using multiple columns to just 2
columns, with the second one being a JSON object.  Is there going to be any
real difference between TEXT or UTF-8 encoded BLOB?

I guess it would probably be easier to get tools like spark to parse the
object as JSON if it’s represented as a BLOB.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Practical limitations of too many columns/cells ?

2015-08-23 Thread Kevin Burton
Ah.. yes.  Great benchmarks. If I’m interpreting them correctly it was ~15x
slower for 22 columns vs 2 columns?

Guess we have to refactor again :-P

Not the end of the world of course.

On Sun, Aug 23, 2015 at 1:53 PM, Jeff Jirsa 
wrote:

> A few months back, a user in #cassandra on freenode mentioned that when
> they transitioned from thrift to cql, their overall performance decreased
> significantly. They had 66 columns per table, so I ran some benchmarks with
> various versions of Cassandra and thrift/cql combinations.
>
> It shouldn’t really surprise you that more columns = more work = slower
> operations. It’s not necessarily the size of the writes, but the amount of
> work that needs to be done with the extra cells (2 large columns totaling
> 2k performs better than 66 small columns totaling 0.66k even though it’s
> three times as much raw data being written to disk)
>
> https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c
>
> 2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660
> bytes per): cassandra-stress --operation INSERT --num-keys 100
> --columns 66 --column-size=10 --replication-factor 2 --nodesfile=nodesAverages
> from the middle 80% of values:interval_op_rate : 10720
>
> 2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200
> bytes per):cassandra-stress --operation INSERT --num-keys 100
> --columns 20 --column-size=10 --replication-factor 2 --nodesfile=nodes 
> Averages
> from the middle 80% of values:interval_op_rate : 28667
>
> 2.0.13, 2 tokens per node, 2 large columns, thrift (2048 bytes 
> per):cassandra-stress
> --operation INSERT --num-keys 100 --columns 2 --column-size=1024
> --replication-factor 2 --nodesfile=nodes Averages from the middle 80% of
> values:interval_op_rate : 23489
>
> From:  on behalf of Kevin Burton
> Reply-To: "user@cassandra.apache.org"
> Date: Sunday, August 23, 2015 at 1:02 PM
> To: "user@cassandra.apache.org"
> Subject: Practical limitations of too many columns/cells ?
>
> Is there any advantage to using say 40 columns per row vs using 2 columns
> (one for the pk and the other for data) and then shoving the data into a
> BLOB as a JSON object?
>
> To date, we’ve been just adding new columns.  I profiled Cassandra and
> about 50% of the CPU time is spent on CPU doing compactions.  Seeing that
> CS is being CPU bottlenecked maybe this is a way I can optimize it.
>
> Any thoughts?
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Practical limitations of too many columns/cells ?

2015-08-23 Thread Jeff Jirsa
A few months back, a user in #cassandra on freenode mentioned that when they 
transitioned from thrift to cql, their overall performance decreased 
significantly. They had 66 columns per table, so I ran some benchmarks with 
various versions of Cassandra and thrift/cql combinations.

It shouldn’t really surprise you that more columns = more work = slower 
operations. It’s not necessarily the size of the writes, but the amount of work 
that needs to be done with the extra cells (2 large columns totaling 2k 
performs better than 66 small columns totaling 0.66k even though it’s three 
times as much raw data being written to disk)

https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c

2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660 bytes 
per):
cassandra-stress --operation INSERT --num-keys 100  --columns 66 
--column-size=10   --replication-factor 2  --nodesfile=nodes
Averages from the middle 80% of values:
interval_op_rate  : 10720


2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200 bytes 
per):
cassandra-stress --operation INSERT --num-keys 100  --columns 20 
--column-size=10   --replication-factor 2  --nodesfile=nodes
Averages from the middle 80% of values:
interval_op_rate  : 28667


2.0.13, 2 tokens per node, 2  large columns, thrift (2048 bytes per):
cassandra-stress --operation INSERT --num-keys 100  --columns 2 
--column-size=1024   --replication-factor 2  --nodesfile=nodes
Averages from the middle 80% of values:
interval_op_rate  : 23489


From:   on behalf of Kevin Burton
Reply-To:  "user@cassandra.apache.org"
Date:  Sunday, August 23, 2015 at 1:02 PM
To:  "user@cassandra.apache.org"
Subject:  Practical limitations of too many columns/cells ?

Is there any advantage to using say 40 columns per row vs using 2 columns (one 
for the pk and the other for data) and then shoving the data into a BLOB as a 
JSON object? 

To date, we’ve been just adding new columns.  I profiled Cassandra and about 
50% of the CPU time is spent on CPU doing compactions.  Seeing that CS is being 
CPU bottlenecked maybe this is a way I can optimize it.

Any thoughts?

-- 
Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




smime.p7s
Description: S/MIME cryptographic signature


Practical limitations of too many columns/cells ?

2015-08-23 Thread Kevin Burton
Is there any advantage to using say 40 columns per row vs using 2 columns
(one for the pk and the other for data) and then shoving the data into a
BLOB as a JSON object?

To date, we’ve been just adding new columns.  I profiled Cassandra and
about 50% of the CPU time is spent on CPU doing compactions.  Seeing that
CS is being CPU bottlenecked maybe this is a way I can optimize it.

Any thoughts?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile