Problem with very many small SSTables

2014-12-15 Thread Mathijs Vogelzang
Hi,

We have a 6-node cassandra cluster that got into an unstable state because
a few servers were very low on Java heap space for a while. This resulted
in them flushing an SSTable to disk for almost every write, such that some
column families ended up with 1000+ SSTables, most of which contain between
1 and 10 rows each.
The memory problem is solved now, and the cluster serves reads  writes
fine, but it doesn't seem to be possible to compact this huge number of
SSTables. If we try to run a major compaction, Cassandra dies with an
OutOfMemoryException, probably because opening an SSTable brings quite some
memory overhead? Increasing the heap by 1GB didn't help either.

Would it be possible to trigger a manual partial compaction, to first
compact 4x 256 tables? Could this be added to nodetool if it doesn't exist
already?

Best regards,

Mathijs Vogelzang


Need Help with Cassandra Tombstone

2014-12-15 Thread Chamila Wijayarathna
Hello all,

I have a column family where I have to update a field frequency, but it is
a clustering key. So I am deleting the existing row and adding a new row
again with updated frequency.

I want to free the space used for deleted rows as soon as possible, so I
decided to change gc_grace_seconds value to a smaller value than default.
For that I used following query in cqlsh.

*alter table corpus.word_inv_pos_frequency with GC_GRACE_SECONDS = 3600;*

Is this enough to free space after one hour? Do I have to do anything else?

Also how can I check the gc_grace value of a column family using cqlsh or
cassandra-cli?

Thank You!

-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.


Re: Need Help with Cassandra Tombstone

2014-12-15 Thread DuyHai Doan
Hello Chamila

 If you're deleting and inserting again a clustering column, it looks like
a queue anti-pattern to be avoided:

http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets



On Mon, Dec 15, 2014 at 10:06 AM, Chamila Wijayarathna 
cdwijayarat...@gmail.com wrote:

 Hello all,

 I have a column family where I have to update a field frequency, but it is
 a clustering key. So I am deleting the existing row and adding a new row
 again with updated frequency.

 I want to free the space used for deleted rows as soon as possible, so I
 decided to change gc_grace_seconds value to a smaller value than default.
 For that I used following query in cqlsh.

 *alter table corpus.word_inv_pos_frequency with GC_GRACE_SECONDS = 3600;*

 Is this enough to free space after one hour? Do I have to do anything else?

 Also how can I check the gc_grace value of a column family using cqlsh or
 cassandra-cli?

 Thank You!

 --
 *Chamila Dilshan Wijayarathna,*
 SMIEEE, SMIESL,
 Undergraduate,
 Department of Computer Science and Engineering,
 University of Moratuwa.



Number of SSTables grows after repair

2014-12-15 Thread Michał Łowicki
Hi,

We've noticed that number of SSTables grows radically after running
*repair*. What we did today is to compact everything so for each node
number of SStables  10. After repair it jumped to ~1600 on each node. What
is interesting is that size of many is very small. The smallest ones are
~60 bytes in size (http://paste.ofcode.org/6yyH2X52emPNrKdw3WXW3d)

Table information - http://paste.ofcode.org/32RijfxQkNeb9cx9GAAnM45
We're using Cassandra 2.1.2.

-- 
BR,
Michał Łowicki


Re: Cassandra Database using too much space

2014-12-15 Thread Jack Krupansky
I also meant to point out that you have to be careful with very wide 
partitions, like those where the partition key is the year, with all usages for 
that year. Thousands of rows in a partition is probably okay, but millions 
could become problematic. 100MB for a single partition is a reasonable limit – 
beyond that you need to start using “buckets” to break up ultra-large 
partitions.

Also, you need to look carefully at how you want to query each table.

-- Jack Krupansky

From: Chamila Wijayarathna 
Sent: Sunday, December 14, 2014 11:36 PM
To: user@cassandra.apache.org 
Subject: Re: Cassandra Database using too much space

Hi Jack , 

Thanks for replying.

Here what I meant by 1.5M words is not 1.5 Distincts words, it is the count of 
all words we added to the corpus (total word instances). Then in word_frequency 
and word_ordered_frequency CFs, we have a row for each distinct word with its 
frequency (two CFs have same data with different indexing). Also we keep 
frequencies year wise ,category wise (newspaper, magazine, fiction, etc.) and 
position where word occur in a sentence. So the distinct word count will be 
probably about 0.2M. We don't keep any details in frequency table where 
frequency is 0. So word 'abc' may only have rows for year 2014 and 2010 if it 
only used in those years.

In bigram and trigram ables, we do not store all possible combinations of 
words, we only store bigrams/trigrams that occur in resources we have 
considered. In word_usage table we have a entry for each word, that means 1.5M 
rows with the context details where the word has been used. Same happens in 
bigrams and trigrams as well.

Here we used separate column families word_usage, word_year_usage, 
word_Category_usage with same details, since we have to search in 4 scenarios, 
using 
  1.. year, 

  2.. category, 

  3.. yearcategory, 

  4.. none

inside WHERE clause and also order them by date. They contain same data but 
different indexing. Same goes with bigram and trigram CFs.

We update frequencies while entering words to database. So for every word 
instances we add, we either insert a new row or update a existing row. In some 
cases where we use frequency as clustering index, since we can't update 
frequency, we delete entire row and add new row with updated frequency. [1] is 
the client we used for inserting data.

I am very new to Cassandra and I may have done lot of bad things in modeling 
and implementing this database. Please let me know if there is anything wrong 
here.

Thank You!

1. 
https://github.com/DImuthuUpe/DBFeederMvn/blob/master/src/main/java/com/sinmin/corpus/cassandra/CassandraClient.java

On Mon, Dec 15, 2014 at 1:46 AM, Jack Krupansky j...@basetechnology.com 
wrote: 
  It looks like you will have quite a few “combinatoric explosions” to cope 
with. In addition to 1.5M words,  you have bigrams – combinations of two and 
three words. You need to get a handle on the cardinality of each of your 
tables. Bigrams and trigrams could give you who knows how many millions more 
rows than the 1.5M word frequency rows.

  And then you have word, bigram, and trigram frequencies by year as well, 
meaning take the counts from above and multiply by the number of years in your 
corpus!

  And then you have word, bigram, and triagram “usage”  - and by year as well. 
Is that every unique sentence from the corpus? Either way, this is an 
incredible combinatoric explosion.

  And then there is category and position, which I didn’t look at since you 
didn’t specify what exactly they are. Once again, start with a focus on 
cardinality of the data.

  In short, just as a thought experiment, say that your 1.5M words expanded 
into 15M rows, divide that into 15Gbytes and that would give you 1000 bytes per 
row, which may be a bit more than desired, but not totally unreasonable. And 
maybe the explosion is more like 30 to 1, which would give like 333 bytes per 
row, which seems quite reasonable.

  Also, are you doing heavy updates, for each word (and bigram and trigram) as 
each occurrence is encountered in the corpus or are you counting things in 
memory and then only writing each row once after the full corpus has been read?

  Also, what is the corpus size – total word instances, both for the full 
corpus and for the subset containing your 1.5 million words?

  -- Jack Krupansky

  From: Chamila Wijayarathna 
  Sent: Sunday, December 14, 2014 7:01 AM
  To: user@cassandra.apache.org 
  Subject: Cassandra Database using too much space

  Hello all, 

  We are trying to develop a language corpus by using Cassandra as its storage 
medium.

  https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types 
of information we need to extract from corpus interface. 

  So we designed schema at 
https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the 
database. Out target is to develop corpus with 100+ million words.

  By now we have inserted about 1.5 million words and database has 

Snappy 1.1.0 Cassandra 2.1.2 compability

2014-12-15 Thread Fredrik Larsson Stigbäck
Is it safe to replace Snappy 1.0.5 in a Cassandra 2.1.2 environment with Snappy 
1.1.0?
I’ve tried running with 1.1.0 and Cassandra seems to run with no issues and 
according to this post https://github.com/xerial/snappy-java/issues/60 
https://github.com/xerial/snappy-java/issues/60 1.1.0 is compatible with 
1.0.5. But might there be problems/data incompatibility in the future when 
upgrading Cassandra to a never version regarding *CompressionInfo.db files 
etc..?

/Fredrik



Re: Cassandra Maintenance Best practices

2014-12-15 Thread Neha Trivedi
Thanks very much Jonathan !!

On Wed, Dec 10, 2014 at 1:00 PM, Jonathan Haddad j...@jonhaddad.com wrote:

 I did a presentation on diagnosing performance problems in production at
 the US  Euro summits, in which I covered quite a few tools  preventative
 measures you should know when running a production cluster.  You may find
 it useful:
 http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/

 On ops center - I recommend it.  It gives you a nice dashboard.  I don't
 think it's completely comprehensive (but no tool really is) but it gets you
 90% of the way there.

 It's a good idea to run repairs, especially if you're doing deletes or
 querying at CL=ONE.  I assume you're not using quorum, because on RF=2
 that's the same as CL=ALL.

 I recommend at least RF=3 because if you lose 1 server, you're on the edge
 of data loss.


 On Tue Dec 09 2014 at 7:19:32 PM Neha Trivedi nehajtriv...@gmail.com
 wrote:

 Hi,
 We have Two Node Cluster Configuration in production with RF=2.

 Which means that the data is written in both the clusters and it's
 running for about a month now and has good amount of data.

 Questions?
 1. What are the best practices for maintenance?
 2. Is OPScenter required to be installed or I can manage with nodetool
 utility?
 3. Is is necessary to run repair weekly?

 thanks
 regards
 Neha




Re: Good partition key doubt

2014-12-15 Thread José Guilherme Vanz
Nice, I got it. =]
If I have more questions I'll send other emails. xD
Thank you

On Thu, Dec 11, 2014 at 12:17 PM, DuyHai Doan doanduy...@gmail.com wrote:

 what is a good partition key? Is partition key direct related with my
 query performance? What is the best practices?

 A good partition key is a partition key that will scale with your data. An
 example: if you have a business involving individuals, it is likely that
 your business will scale as soon as the number of users will grow. In this
 case user_id is a good partition key because all the users will
 be uniformly distributed over all the Cassandra nodes.

 For your log example, using only server_id for partition key is clearly
 not enough because what will scale is the log lines, not the number of
 server.

 From the point of view of scalability (not taking about query-ability),
 adding the log_type will not scale either, because the number of different
 log types is likely to be a small set. For great scalability (not taking
 about query-ability), the couple (server_id,log_timestamp) is likely a good
 combination.

  Now for query, as you should know, it is not possible to have range query
 (using , ≤, ≥, ) over partition key, you must always use equality (=) so
 you won't be able to leverage the log_timestamp component in the partition
 key for your query.

 Bucketing by date is a good idea though, and the date resolution will
 depends on the log generation rate. If logs are generated very often, maybe
 a bucket by hour. If the generation rate is smaller, maybe a day or a week
 bucket is fine.

 Talking about log_type, putting it into the partition key will help
 partitioning further, in addition of the date bucket. However it forces you
 to always provide a log_type whenever you want to query, be aware of this.

 An example of data model for your logs could be

 CREATE TABLE logs_by_server_and_type_and_date(
server_id int,
log_type text,
date_bucket int, //Date bucket using format MMDD or MMDDHH or
 ...
log_timestamp timeuuid,
log_info text,
PRIMARY KEY((server_id,log_type,date_bucket),log_timestamp)
 );


 And if I want to query all logs in a period of time how can I select I
 range o rows? -- New query path = new table

 CREATE TABLE logs_by_date(
date_bucket int, //Date bucket using format MMDD or MMDDHH or
 ...
log_timestamp timeuuid,
server_id int,
log_type text,
log_info text,
PRIMARY KEY((date_bucket),log_timestamp) // you may add server_id or
 log_type as clustering column optionally
 );

 For this table, the date_bucket should be chosen very carefully because
 for the same bucket, we're going to store logs of ALL servers and all types
 ...

 For the query, you should provide the date bucket as partition key, and
 then use (, ≤, ≥, ) on the log_timestamp column


 On Thu, Dec 11, 2014 at 12:00 PM, José Guilherme Vanz 
 guilherme@gmail.com wrote:

 Hello folks

 I am studying Cassandra for a short a period of time and now I am
 modeling a database for study purposes. During my modeling I have faced a
 doubt, what is a good partition key? Is partition key direct related with
 my query performance? What is the best practices?

 Just to study case, let's suppose I have a column family where is
 inserted all kind of logs ( http server, application server, application
 logs, etc ) data from different servers. In this column family I have
 server_id ( unique identifier for each server ) column, log_type ( http
 server,  application server, application log ) column and log_info column.
 Is a good ideia create a partition key using server_id and log_type columns
 to store all logs data from a specific type and server in a physical row?
 And if do I want a physical row for each day? Is a good idea add a third
 column with the date in the partition key? And if I want to query all logs
 in a period of time how can I select I range o rows? Do I have to duplicate
 date column ( considering I have to use = operator with partition key ) ?

 All the best
 --
 Att. José Guilherme Vanz
 br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
 http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/
 O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca,
 recordista da Antarctic Ice Marathon.




-- 
Att. José Guilherme Vanz
br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/
O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca,
recordista da Antarctic Ice Marathon.


Re: batch_size_warn_threshold_in_kb

2014-12-15 Thread Eric Stevens
 Unfortunately my Scala isn't the best so I'm going to have to take a
little bit to wade through the code.

I think the important thing to take from this code is that:

1) execution order is randomized for each run, and new data is randomly
generated for each run to eliminate biases.
2) we write to five different key layouts in an attempt to eliminate bias
from some poorly chosen scheme, we test both clustering and non-clustering
approaches
3) We can fork *just* on batch-vs-single strategy (see
https://gist.github.com/MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L167-L180
) thanks to the DS driver having a common executable ancestor between them
(an extremely nice feature)
4) We test three different parallelism strategies to eliminate bias from a
poorly chosen concurrency model (see
https://gist.github.com/MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L181-L203
)
5) The code path is identical wherever possible between strategies.
6) Principally this just sets up an Iterable of Statement (sometimes
members are batches, sometimes members are single statements), and times
how long they take to execute and complete with different concurrency
models.

*RE: Cassandra-Stress*
 It may be useful to run cassandra-stress (it doesn't seem to have a mode
for batches) to get a baseline on non-batches.  I'm curious to know if you
get different numbers than the scala profiler.

We always use SSL for everything, and I've struggled to get
cassandra-stress to talk to our SSL cluster.  Just so I don't keep spinning
my wheels on a temporary effort, I used CCM to stand up a 2.0.11 cluster
locally, and ran both tools against here.  I'm dubious about what you can
infer from such a test because it's not apples to apples (they write
different data).

Nevertheless, here is the output of ccm stress against my local machine -
I inserted 113,825 records in 62 seconds, and used this data size to drive
my tool:

Created keyspaces. Sleeping 3s for propagation.
total   interval_op_rate  interval_key_rate  latency  95th   99.9th
 elapsed_time
11271   1127  1127   8.9  144.7  401.1   10
27998   1672  1672   9.5  140.5  399.4   20
42189   1419  1419,  9.3  148.0  494.5   31
59335   1714  1714   9.3  147.0  493.2   41
84957   2562  2562   6.1  137.1  493.3   51
113825  2886  2886   5.1  131.5  493.3   62


After a ccm clear  ccm start , here's my tool this same local cluster
(note that I'm actually writing a total of 5x the records because I write
the same data to each of 5 tables).  My little local cluster just about
brought down my machine under this test (especially the second one).

 Execution Results for 1 runs of 113825 records =
1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) as single
statements
Total Run Time
traverse test2 ((aid, bckt), end) =
25,488,179,000
traverse test4 ((aid, bckt), proto, end) no explicit ordering =
25,497,183,000
traverse test5 ((aid, bckt, end)) =
25,529,444,000
traverse test3 ((aid, bckt), end, proto) reverse order=
31,495,348,000
traverse test1 ((aid, bckt), proto, end) reverse order=
33,686,013,000

 Execution Results for 1 runs of 113825 records =
1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
of 10
Total Run Time
traverse test3 ((aid, bckt), end, proto) reverse order=
11,030,788,000
traverse test1 ((aid, bckt), proto, end) reverse order=
13,345,962,000
traverse test2 ((aid, bckt), end) =
15,110,208,000
traverse test4 ((aid, bckt), proto, end) no explicit ordering =
16,398,982,000
traverse test5 ((aid, bckt, end)) =
22,166,119,000

For giggles I added token aware batching (grouping statements within a
single batch by meta.getReplicas(statement.getKeyspace,
statement.getRoutingKey).iterator().next - see
https://gist.github.com/MightyE/1c98912fca104f6138fc#file-testsuite-L176-L189
), here's that run; comparable results with before, and easily inside one
sigma of non-token-aware batching, so not a statistically significant
difference.

 Execution Results for 1 runs of 113825 records =
1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches
of 10
Total Run Time
traverse test2 ((aid, bckt), end) =
11,429,008,000
traverse test1 ((aid, bckt), proto, end) reverse order=
12,593,034,000
traverse test4 ((aid, bckt), proto, end) no explicit ordering =
13,111,244,000
traverse test3 ((aid, bckt), end, proto) reverse order=
25,163,064,000
traverse test5 ((aid, bckt, end)) =
30,233,744,000



On Sat, Dec 13, 2014 at 11:07 AM, Jonathan Haddad j...@jonhaddad.com 

Changing replication factor of Cassandra cluster

2014-12-15 Thread Pranay Agarwal
Hi All,


I have 20 nodes cassandra cluster with 500gb of data and replication factor
of 1. I increased the replication factor to 3 and ran nodetool repair on
each node one by one as the docs says. But it takes hours for 1 node to
finish repair. Is that normal or am I doing something wrong?

Also, I took backup of cassandra data on each node. How do I restore the
graph in a new cluster of nodes using the backup? Do I have to have the
tokens range backed up as well?

-Pranay


Is it possible to flush memtable in one virtual center?

2014-12-15 Thread Benyi Wang
We have one ring and two virtual data centers in our Cassandra cluster? one
is for Real-Time and the other is for analytics. My questions are:

   1. Are there memtables in Analytics Data Center? To my understanding, it
   is true.
   2. Is it possible to flush memtables if exist in Analytics Data Center
   only?

I'm using Cassandra 1.0.7 for this cluster.

Thanks.


Re: batch_size_warn_threshold_in_kb

2014-12-15 Thread Jonathan Haddad
You are, of course, free to use batches in your application.  Keep in mind
however, that both my and Ryan's advice is coming from debugging issues in
production.  I don't know why your Scala script is performing better on
batches than async.  It could be:

1) network.  are you running the test script on your laptop and connecting
to cluster over WAN?  If so, I would not be shocked if batch was faster
since your latency is going to be crazy high.

2) is the system under any other load?  I'd love to see the results of the
tests while cassandra stress was running.  This is a step closer to
production where you have to worry about such things

3) The logic for doing async queries may be incorrect.
a) Are you just throwing all the queries at once against the cluster?  If
so, I'd love to see what's happening with GC.  Typically in a real workload
you'd be
b) Are you keeping the servers busy?  If you're calling wait() on a group
of futures, you're now blocking requests from being submitted and limiting
the throughput.

4) you're still only using 3 servers.  The horror of using batches
increases linearly as you add servers.

5) What exactly are you summing in the end?  The total real time taken, or
an aggregation of the async query times?  If it's the async query times
that's going to be pretty misleading (and incorrect).  Again, my Scala is
terrible so I could be reading it wrong.

Sorry I don't have more time to debug the script.  Any of the above ideas
apply?

Jon

On Mon Dec 15 2014 at 1:11:43 PM Eric Stevens migh...@gmail.com wrote:

  Unfortunately my Scala isn't the best so I'm going to have to take a
 little bit to wade through the code.

 I think the important thing to take from this code is that:

 1) execution order is randomized for each run, and new data is randomly
 generated for each run to eliminate biases.
 2) we write to five different key layouts in an attempt to eliminate bias
 from some poorly chosen scheme, we test both clustering and non-clustering
 approaches
 3) We can fork *just* on batch-vs-single strategy (see
 https://gist.github.com/MightyE/1c98912fca104f6138fc/
 a7db68e72f99ac1215fcfb096d69391ee285c080#file-testsuite-L167-L180 )
 thanks to the DS driver having a common executable ancestor between them
 (an extremely nice feature)
 4) We test three different parallelism strategies to eliminate bias from a
 poorly chosen concurrency model (see https://gist.github.com/
 MightyE/1c98912fca104f6138fc/a7db68e72f99ac1215fcfb096d6939
 1ee285c080#file-testsuite-L181-L203 )
 5) The code path is identical wherever possible between strategies.
 6) Principally this just sets up an Iterable of Statement (sometimes
 members are batches, sometimes members are single statements), and times
 how long they take to execute and complete with different concurrency
 models.

 *RE: Cassandra-Stress*
  It may be useful to run cassandra-stress (it doesn't seem to have a mode
 for batches) to get a baseline on non-batches.  I'm curious to know if you
 get different numbers than the scala profiler.

 We always use SSL for everything, and I've struggled to get
 cassandra-stress to talk to our SSL cluster.  Just so I don't keep spinning
 my wheels on a temporary effort, I used CCM to stand up a 2.0.11 cluster
 locally, and ran both tools against here.  I'm dubious about what you can
 infer from such a test because it's not apples to apples (they write
 different data).

 Nevertheless, here is the output of ccm stress against my local machine
 - I inserted 113,825 records in 62 seconds, and used this data size to
 drive my tool:

 Created keyspaces. Sleeping 3s for propagation.
 total   interval_op_rate  interval_key_rate  latency  95th   99.9th
  elapsed_time
 11271   1127  1127   8.9  144.7  401.1   10
 27998   1672  1672   9.5  140.5  399.4   20
 42189   1419  1419,  9.3  148.0  494.5   31
 59335   1714  1714   9.3  147.0  493.2   41
 84957   2562  2562   6.1  137.1  493.3   51
 113825  2886  2886   5.1  131.5  493.3   62


 After a ccm clear  ccm start , here's my tool this same local cluster
 (note that I'm actually writing a total of 5x the records because I write
 the same data to each of 5 tables).  My little local cluster just about
 brought down my machine under this test (especially the second one).

  Execution Results for 1 runs of 113825 records =
 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) as single
 statements
 Total Run Time
 traverse test2 ((aid, bckt), end) =
 25,488,179,000
 traverse test4 ((aid, bckt), proto, end) no explicit ordering =
 25,497,183,000
 traverse test5 ((aid, bckt, end)) =
 25,529,444,000
 traverse test3 ((aid, bckt), end, proto) reverse order=
 31,495,348,000
 traverse test1 ((aid, bckt), proto, end) reverse 

Re: Is it possible to flush memtable in one virtual center?

2014-12-15 Thread Hannu Kröger
Hi,

You have memtables on each machine. So

1) Yes
2) Yes, in any case you have to run nodetool flush for each node that you
want to flush. In this case you run flush each node in your analytics DC.

Hannu

2014-12-16 1:20 GMT+02:00 Benyi Wang bewang.t...@gmail.com:

 We have one ring and two virtual data centers in our Cassandra cluster?
 one is for Real-Time and the other is for analytics. My questions are:

1. Are there memtables in Analytics Data Center? To my understanding,
it is true.
2. Is it possible to flush memtables if exist in Analytics Data Center
only?

 I'm using Cassandra 1.0.7 for this cluster.

 Thanks.