Permanent ReadTimeout

2015-01-12 Thread Ja Sam
*Environment*


   - Cassandra 2.1.0
   - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
   - 2500 writes per seconds, I write only to DC_A with local_quorum
   - minimal reads (usually none, sometimes few)

*Problem*

After a few weeks of running I cannot read any data from my cluster,
because I have ReadTimeoutException like following:

ERROR [Thrift:15] 2015-01-07 14:16:21,124
CustomTThreadPoolServer.java:219 - Error occurred during processing of
message.
com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 2 responses.

To be precise it is not only problem in my cluster, The second one was
described here: Cassandra GC takes 30 seconds and hangs node

and
I will try to use fix from CASSANDRA-6541
 as leshkin suggested

*Diagnose *

I tried to use some tools which were presented on
http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
by Jon Haddad and have some strange result.


I tried to run same query in DC_A and DC_B with tracing enabled. Query is
simple:

   SELECT * FROM X.customer_events WHERE customer='1234567' AND
utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);

Where table is defiied as following:

  CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
int, bucket
int, event_time bigint, event_id blob, event_type int, event blob,

  PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
event_type)[...]

Results of the query:

1) In DC_B the query finished in less then a 0.22 of second . In DC_A more
then 2.5 (~10 times longer). -> the problem is that bucket can be in range
form -128 to 256

2) In DC_B it checked ~1000 SSTables with lines like:

   Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
2015-01-12 13:51:49.467001 | 192.168.71.198 |   4782

Where in DC_A it is:

   Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
2015-01-12 14:01:39.520001 | 192.168.61.199 |  25527

3) Total records in both DC were same.


*Question*

The question is quite simple: how can I speed up DC_A - it is my primary
DC, DC_B is mostly for backup, and there is a lot of network partitions
between A and B.

Maybe I should check something more, but I just don't have an idea what it
should be.


Re: Permanent ReadTimeout

2015-01-12 Thread Ja Sam
To precise your remarks:

1) About 30 sec GC. I know that after time my cluster had such problem, we
added "magic" flag, but result will be in ~2 weeks (as I presented in
screen on StackOverflow). If you have any idea how can fix/diagnose this
problem, I will be very grateful.

2) It is probably true, but I don't think that I can change it. Our data
centers are in different places and the network between them is not
perfect. But as we observed network partition happened rare. Maximum is
once a week for an hour.

3) We are trying to do a regular repairs (incremental), but usually they do
not finish. Even local repairs have problems with finishing.

4) I will check it as soon as possible and post it here. If you have any
suggestion what else should I check, you are welcome :)




On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens  wrote:

> If you're getting 30 second GC's, this all by itself could and probably
> does explain the problem.
>
> If you're writing exclusively to A, and there are frequent partitions
> between A and B, then A is potentially working a lot harder than B, because
> it needs to keep track of hinted handoffs to replay to B whenever
> connectivity is restored.  It's also acting as coordinator for writes which
> need to end up in B eventually.  This in turn may be a significant
> contributing factor to your GC pressure in A.
>
> I'd also grow suspicious of the integrity of B as a reliable backup of A
> unless you're running repair on a regular basis.
>
> Also, if you have thousands of SSTables, then you're probably falling
> behind on compaction, check nodetool compactionstats - you should typically
> have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
> compaction, your sstable_size_in_mb might be a bad value for your use case.
>
> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam  wrote:
>
>> *Environment*
>>
>>
>>- Cassandra 2.1.0
>>- 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>>- 2500 writes per seconds, I write only to DC_A with local_quorum
>>- minimal reads (usually none, sometimes few)
>>
>> *Problem*
>>
>> After a few weeks of running I cannot read any data from my cluster,
>> because I have ReadTimeoutException like following:
>>
>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 - 
>> Error occurred during processing of message.
>> com.google.common.util.concurrent.UncheckedExecutionException: 
>> java.lang.RuntimeException: 
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
>> received only 2 responses.
>>
>> To be precise it is not only problem in my cluster, The second one was
>> described here: Cassandra GC takes 30 seconds and hangs node
>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node>
>>  and
>> I will try to use fix from CASSANDRA-6541
>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin
>> suggested
>>
>> *Diagnose *
>>
>> I tried to use some tools which were presented on
>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
>> by Jon Haddad and have some strange result.
>>
>>
>> I tried to run same query in DC_A and DC_B with tracing enabled. Query is
>> simple:
>>
>>SELECT * FROM X.customer_events WHERE customer='1234567' AND
>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>>
>> Where table is defiied as following:
>>
>>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
>> int, bucket int, event_time bigint, event_id blob, event_type int, event
>> blob,
>>
>>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
>> event_type)[...]
>>
>> Results of the query:
>>
>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A
>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in
>> range form -128 to 256
>>
>> 2) In DC_B it checked ~1000 SSTables with lines like:
>>
>>Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
>> 2015-01-12 13:51:49.467001 | 192.168.71.198 |   4782
>>
>> Where in DC_A it is:
>>
>>Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
>> 2015-01-12 14:01:39.520001 | 192.168.61.199 |  25527
>>
>> 3) Total records in both DC were same.
>>
>>
>> *Question*
>>
>> The question is quite simple: how can I speed up DC_A - it is my primary
>> DC, DC_B is mostly for backup, and there is a lot of network partitions
>> between A and B.
>>
>> Maybe I should check something more, but I just don't have an idea what
>> it should be.
>>
>>
>>
>


Re: Permanent ReadTimeout

2015-01-13 Thread Ja Sam
Ad 4) For sure I got a big problem. Because pending tasks: 3094

The question is what should I change/monitor? I can present my whole
solution design, if it helps

On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam  wrote:

> To precise your remarks:
>
> 1) About 30 sec GC. I know that after time my cluster had such problem, we
> added "magic" flag, but result will be in ~2 weeks (as I presented in
> screen on StackOverflow). If you have any idea how can fix/diagnose this
> problem, I will be very grateful.
>
> 2) It is probably true, but I don't think that I can change it. Our data
> centers are in different places and the network between them is not
> perfect. But as we observed network partition happened rare. Maximum is
> once a week for an hour.
>
> 3) We are trying to do a regular repairs (incremental), but usually they
> do not finish. Even local repairs have problems with finishing.
>
> 4) I will check it as soon as possible and post it here. If you have any
> suggestion what else should I check, you are welcome :)
>
>
>
>
> On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens  wrote:
>
>> If you're getting 30 second GC's, this all by itself could and probably
>> does explain the problem.
>>
>> If you're writing exclusively to A, and there are frequent partitions
>> between A and B, then A is potentially working a lot harder than B, because
>> it needs to keep track of hinted handoffs to replay to B whenever
>> connectivity is restored.  It's also acting as coordinator for writes which
>> need to end up in B eventually.  This in turn may be a significant
>> contributing factor to your GC pressure in A.
>>
>> I'd also grow suspicious of the integrity of B as a reliable backup of A
>> unless you're running repair on a regular basis.
>>
>> Also, if you have thousands of SSTables, then you're probably falling
>> behind on compaction, check nodetool compactionstats - you should typically
>> have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
>> compaction, your sstable_size_in_mb might be a bad value for your use case.
>>
>> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam  wrote:
>>
>>> *Environment*
>>>
>>>
>>>- Cassandra 2.1.0
>>>- 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>>>- 2500 writes per seconds, I write only to DC_A with local_quorum
>>>- minimal reads (usually none, sometimes few)
>>>
>>> *Problem*
>>>
>>> After a few weeks of running I cannot read any data from my cluster,
>>> because I have ReadTimeoutException like following:
>>>
>>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 
>>> - Error occurred during processing of message.
>>> com.google.common.util.concurrent.UncheckedExecutionException: 
>>> java.lang.RuntimeException: 
>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
>>> received only 2 responses.
>>>
>>> To be precise it is not only problem in my cluster, The second one was
>>> described here: Cassandra GC takes 30 seconds and hangs node
>>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node>
>>>  and
>>> I will try to use fix from CASSANDRA-6541
>>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin
>>> suggested
>>>
>>> *Diagnose *
>>>
>>> I tried to use some tools which were presented on
>>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
>>> by Jon Haddad and have some strange result.
>>>
>>>
>>> I tried to run same query in DC_A and DC_B with tracing enabled. Query
>>> is simple:
>>>
>>>SELECT * FROM X.customer_events WHERE customer='1234567' AND
>>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>>>
>>> Where table is defiied as following:
>>>
>>>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
>>> int, bucket int, event_time bigint, event_id blob, event_type int, event
>>> blob,
>>>
>>>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
>>> event_type)[...]
>>>
>>> Results of the query:
>>>
>>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A
>>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in
>>> range form -128 to 256
>>>
>>> 2) In DC_B it checked ~1000 SSTables with lines like:
>>>
>>>Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
>>> 2015-01-12 13:51:49.467001 | 192.168.71.198 |   4782
>>>
>>> Where in DC_A it is:
>>>
>>>Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
>>> 2015-01-12 14:01:39.520001 | 192.168.61.199 |  25527
>>>
>>> 3) Total records in both DC were same.
>>>
>>>
>>> *Question*
>>>
>>> The question is quite simple: how can I speed up DC_A - it is my primary
>>> DC, DC_B is mostly for backup, and there is a lot of network partitions
>>> between A and B.
>>>
>>> Maybe I should check something more, but I just don't have an idea what
>>> it should be.
>>>
>>>
>>>
>>
>


Re: Permanent ReadTimeout

2015-01-13 Thread Ja Sam
Your response is full of information, after I read it I think that I design
something wrong in my system. I will try to present what hardware I have
and what I am trying to achieve.

*Hardware:*
I have 9 machines, every machine has 10 hdd for data (not SSD) and 64 GB of
RAM.

*Requirements*
The Cassandra storage is design for audit data, so the only operation is
INSERT.
Each even have following properties: customer, UUID, event type (there are
4 types), date-time and some other properties. Event is stored as protobuf
in blob.
There are two types of customers which generates me an events: customer
with small amount daily (up to 100 events) and with lots of events daily
(up to 100 thousand). But with customer id I don't know which type of user
it is.

There are two types of queries which I need to run:
1) Select all events for customer in for date range. The range is small -
up to few days. It is an "audit" query
2) Select all events UUID for one day - it is for reconciliation process,
we need to check if every event was stored in Cassandra.

*Key-spaces*
Each day I write into two keyspaces:
1) One for storing data for audit query. The table definition I presented
in previous mails.
2) One for reconciliation only - it is one day keyspace. After
reconciliation I can safety delete it.


*Data replication*
We have set the following replication settings:
REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC_A' : 5, 'DC_B'
: 3};
which means that all machines in DC_A have all data. In DC_B one machine
has 3/4 of data.

*Disk usage*
When I checked disk usage not all disk have same usage and used space.

*Questions*
1) Is there a way to utilize hdd better?
2) Maybe I should write to multiple keyspaces to have better hdd
utilization?
3) Are my replication settings correct? Or maybe they are too big?
4) I can easy reduce write operation just removing reconciliation keyspace,
but still I will have to find a way to run query for getting all UUIDs for
one day.

I hope I presented enough information, if something is missing just write.
Thanks again for help.



On Tue, Jan 13, 2015 at 5:35 PM, Eric Stevens  wrote:

> If you have fallen far behind on compaction, this is a hard situation to
> recover from.  It means that you're writing data faster than your cluster
> can absorb it.  The right path forward depends on a lot of factors, but in
> general you either need more servers or bigger servers, or else you need to
> write less data.
>
> Safely adding servers is actually hard in this situation, lots of
> aggressive compaction produces a result where bootstrapping new nodes
> (growing your cluster) causes a lot of over-streaming, meaning data that is
> getting compacted may be streamed multiple times, in the old SSTable, and
> again in the new post-compaction SSTable, and maybe again in another
> post-compaction SSTable.  For a healthy cluster, it's a trivial amount of
> overstreaming.  For an unhealthy cluster like this, you might not actually
> ever complete streaming and be able to successfully join the cluster before
> your target server's disks are full.
>
> If you can afford the space and don't already have things set up this way,
> disable compression and switch to size tiered compaction (you'll need to
> keep at least 50% of your disk space free to be safe in size tiered).  Also
> nodetool setcompactionthroughput will let you open the flood gates to try
> to catch up on compaction quickly (at the cost of read and write
> performance into the cluster).
>
> If you still can't catch up on compaction, you have a very serious
> problem.  You need to either reduce your write volume, or grow your cluster
> unsafely (disable bootstrapping new nodes) to reduce write pressure on your
> existing nodes.  Either way you should get caught up on compaction before
> you can safely add new nodes again.
>
> If you grow unsafely, you are effectively electing to discard data.  Some
> of it may be recoverable with a nodetool repair after you're caught up on
> compaction, but you will almost certainly lose some records.
>
> On Tue, Jan 13, 2015 at 2:22 AM, Ja Sam  wrote:
>
>> Ad 4) For sure I got a big problem. Because pending tasks: 3094
>>
>> The question is what should I change/monitor? I can present my whole
>> solution design, if it helps
>>
>> On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam  wrote:
>>
>>> To precise your remarks:
>>>
>>> 1) About 30 sec GC. I know that after time my cluster had such problem,
>>> we added "magic" flag, but result will be in ~2 weeks (as I presented in
>>> screen on StackOverflow). If you have any idea how can fix/diagnose this
>>> problem, I will be very grateful.
>>>
>>> 2) It is pr

How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam
Is there a simple way (or even a complicated one) how can I speed up SELECT
* FROM [table] query?
I need to get all rows form one table every day. I split tables, and create
one for each day, but still query is quite slow (200 millions of records)

I was thinking about run this query in parallel, but I don't know if it is
possible


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam
Your answer looks very promising

 How do you calculate start and stop?

On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky  wrote:

> The fastest way I am aware of is to do the queries in parallel to
> multiple cassandra nodes and make sure that you only ask them for keys
> they are responsible for. Otherwise, the node needs to resend your query
> which is much slower and creates unnecessary objects (and thus GC
> pressure).
>
> You can manually take advantage of the token range information, if the
> driver does not get this into account for you. Then, you can play with
> concurrency and batch size of a single query against one node.
> Basically, what you/driver should do is to transform the query to series
> of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".
>
> I will need to look up the actual code, but the idea should be clear :)
>
> Jirka H.
>
>
> On 02/11/2015 11:26 AM, Ja Sam wrote:
> > Is there a simple way (or even a complicated one) how can I speed up
> > SELECT * FROM [table] query?
> > I need to get all rows form one table every day. I split tables, and
> > create one for each day, but still query is quite slow (200 millions
> > of records)
> >
> > I was thinking about run this query in parallel, but I don't know if
> > it is possible
>
>


Many pending compactions

2015-02-16 Thread Ja Sam
*Environment*
1) Actual Cassandra 2.1.3, it was upgraded from 2.1.0 (suggested by Al
Tobey from DataStax)
2) not using vnodes
3)Two data centres: 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
4) each node is set up on a physical box with two 16-Core HT Xeon
processors (E5-2660), 64GB RAM and 10x2TB 7.2K SAS disks (one for
commitlog, nine for Cassandra data file directories), 1Gbps network. No
RAID, only JBOD.
5) 3500 writes per seconds, I write only to DC_A with local_quorum with
RF=5 in the local DC_A on our largest CF’s.
6) acceptable write times (usually a few ms unless we encounter some
problem within the cluster)
7) minimal reads (usually none, sometimes few)
8) iostat looks like ok ->
http://serverfault.com/questions/666136/interpreting-disk-stats-using-sar
9) We use SizeTired compaction. We convert to it from LevelTired


*Problems*
Nowadays we see two main problems:
1) In DC_A we have a rally lot of pending compactions (400-700 depending on
node). In DC_B everything is fine (10 is short term maximum, usually is
less then 3). The pending compaction does not change in long term.
2) In DC_A reads usually has timeout exception. In DC_B is fast and works
without problems.

*The question*
Is there a way how can I diagnose what is wrong with my servers? I
understand that DC_A is doing much more work than DC_B, but tested much
bigger load on test machine for few days and everything was fine.


Many pending compactions

2015-02-16 Thread Ja Sam
Of couse I made a mistake. I am using 2.1.2. Anyway night build is
available from
http://cassci.datastax.com/job/cassandra-2.1/

I read about cold_reads_to_omit It looks promising. Should I set also
compaction throughput?

p.s. I am really sad that I didn't read this before:
https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/



On Monday, February 16, 2015, Carlos Rolo > wrote:

> Hi 100% in agreement with Roland,
>
> 2.1.x series is a pain! I would never recommend the current 2.1.x series
> for production.
>
> Clocks is a pain, and check your connectivity! Also check tpstats to see
> if your threadpools are being overrun.
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
> *
> Tel: 1649
> www.pythian.com
>
> On Mon, Feb 16, 2015 at 8:12 PM, Roland Etzenhammer <
> r.etzenham...@t-online.de> wrote:
>
>> Hi,
>>
>> 1) Actual Cassandra 2.1.3, it was upgraded from 2.1.0 (suggested by Al
>> Tobey from DataStax)
>> 7) minimal reads (usually none, sometimes few)
>>
>> those two points keep me repeating an anwser I got. First where did you
>> get 2.1.3 from? Maybe I missed it, I will have a look. But if it is 2.1.2
>> whis is the latest released version, that version has many bugs - most of
>> them I got kicked by while testing 2.1.2. I got many problems with
>> compactions not beeing triggred on column families not beeing read,
>> compactions and repairs not beeing completed.  See
>>
>> https://www.mail-archive.com/search?l=user@cassandra.
>> apache.org&q=subject:%22Re%3A+Compaction+failing+to+trigger%
>> 22&o=newest&f=1
>> https://www.mail-archive.com/user%40cassandra.apache.org/msg40768.html
>>
>> Apart from that, how are those both datacenters connected? Maybe there is
>> a bottleneck.
>>
>> Also do you have ntp up and running on all nodes to keep all clocks in
>> thight sync?
>>
>> Note: I'm no expert (yet) - just sharing my 2 cents.
>>
>> Cheers,
>> Roland
>>
>
>
> --
>
>
>
>


Re: Many pending compactions

2015-02-16 Thread Ja Sam
One think I do not understand. In my case compaction is running
permanently. Is there a way to check which compaction is pending? The only
information is about total count.

On Monday, February 16, 2015, Ja Sam  wrote:

> Of couse I made a mistake. I am using 2.1.2. Anyway night build is
> available from
> http://cassci.datastax.com/job/cassandra-2.1/
>
> I read about cold_reads_to_omit It looks promising. Should I set also
> compaction throughput?
>
> p.s. I am really sad that I didn't read this before:
> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>
>
>
> On Monday, February 16, 2015, Carlos Rolo  wrote:
>
>> Hi 100% in agreement with Roland,
>>
>> 2.1.x series is a pain! I would never recommend the current 2.1.x series
>> for production.
>>
>> Clocks is a pain, and check your connectivity! Also check tpstats to see
>> if your threadpools are being overrun.
>>
>> Regards,
>>
>> Carlos Juzarte Rolo
>> Cassandra Consultant
>>
>> Pythian - Love your data
>>
>> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
>> <http://linkedin.com/in/carlosjuzarterolo>*
>> Tel: 1649
>> www.pythian.com
>>
>> On Mon, Feb 16, 2015 at 8:12 PM, Roland Etzenhammer <
>> r.etzenham...@t-online.de> wrote:
>>
>>> Hi,
>>>
>>> 1) Actual Cassandra 2.1.3, it was upgraded from 2.1.0 (suggested by Al
>>> Tobey from DataStax)
>>> 7) minimal reads (usually none, sometimes few)
>>>
>>> those two points keep me repeating an anwser I got. First where did you
>>> get 2.1.3 from? Maybe I missed it, I will have a look. But if it is 2.1.2
>>> whis is the latest released version, that version has many bugs - most of
>>> them I got kicked by while testing 2.1.2. I got many problems with
>>> compactions not beeing triggred on column families not beeing read,
>>> compactions and repairs not beeing completed.  See
>>>
>>> https://www.mail-archive.com/search?l=user@cassandra.
>>> apache.org&q=subject:%22Re%3A+Compaction+failing+to+trigger%
>>> 22&o=newest&f=1
>>> https://www.mail-archive.com/user%40cassandra.apache.org/msg40768.html
>>>
>>> Apart from that, how are those both datacenters connected? Maybe there
>>> is a bottleneck.
>>>
>>> Also do you have ntp up and running on all nodes to keep all clocks in
>>> thight sync?
>>>
>>> Note: I'm no expert (yet) - just sharing my 2 cents.
>>>
>>> Cheers,
>>> Roland
>>>
>>
>>
>> --
>>
>>
>>
>>


Re: Many pending compactions

2015-02-17 Thread Ja Sam
I set setcompactionthroughput 999 permanently and it doesn't change
anything. IO is still same. CPU is idle.

On Tue, Feb 17, 2015 at 1:15 AM, Roni Balthazar 
wrote:

> Hi,
>
> You can run "nodetool compactionstats" to view statistics on compactions.
> Setting cold_reads_to_omit to 0.0 can help to reduce the number of
> SSTables when you use Size-Tiered compaction.
> You can also create a cron job to increase the value of
> setcompactionthroughput during the night or when your IO is not busy.
>
> From http://wiki.apache.org/cassandra/NodeTool:
> 0 0 * * * root nodetool -h `hostname` setcompactionthroughput 999
> 0 6 * * * root nodetool -h `hostname` setcompactionthroughput 16
>
> Cheers,
>
> Roni Balthazar
>
> On Mon, Feb 16, 2015 at 7:47 PM, Ja Sam  wrote:
> > One think I do not understand. In my case compaction is running
> permanently.
> > Is there a way to check which compaction is pending? The only
> information is
> > about total count.
> >
> >
> > On Monday, February 16, 2015, Ja Sam  wrote:
> >>
> >> Of couse I made a mistake. I am using 2.1.2. Anyway night build is
> >> available from
> >> http://cassci.datastax.com/job/cassandra-2.1/
> >>
> >> I read about cold_reads_to_omit It looks promising. Should I set also
> >> compaction throughput?
> >>
> >> p.s. I am really sad that I didn't read this before:
> >>
> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
> >>
> >>
> >>
> >> On Monday, February 16, 2015, Carlos Rolo  wrote:
> >>>
> >>> Hi 100% in agreement with Roland,
> >>>
> >>> 2.1.x series is a pain! I would never recommend the current 2.1.x
> series
> >>> for production.
> >>>
> >>> Clocks is a pain, and check your connectivity! Also check tpstats to
> see
> >>> if your threadpools are being overrun.
> >>>
> >>> Regards,
> >>>
> >>> Carlos Juzarte Rolo
> >>> Cassandra Consultant
> >>>
> >>> Pythian - Love your data
> >>>
> >>> rolo@pythian | Twitter: cjrolo | Linkedin:
> >>> linkedin.com/in/carlosjuzarterolo
> >>> Tel: 1649
> >>> www.pythian.com
> >>>
> >>> On Mon, Feb 16, 2015 at 8:12 PM, Roland Etzenhammer
> >>>  wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 1) Actual Cassandra 2.1.3, it was upgraded from 2.1.0 (suggested by Al
> >>>> Tobey from DataStax)
> >>>> 7) minimal reads (usually none, sometimes few)
> >>>>
> >>>> those two points keep me repeating an anwser I got. First where did
> you
> >>>> get 2.1.3 from? Maybe I missed it, I will have a look. But if it is
> 2.1.2
> >>>> whis is the latest released version, that version has many bugs -
> most of
> >>>> them I got kicked by while testing 2.1.2. I got many problems with
> >>>> compactions not beeing triggred on column families not beeing read,
> >>>> compactions and repairs not beeing completed.  See
> >>>>
> >>>>
> >>>>
> https://www.mail-archive.com/search?l=user@cassandra.apache.org&q=subject:%22Re%3A+Compaction+failing+to+trigger%22&o=newest&f=1
> >>>>
> https://www.mail-archive.com/user%40cassandra.apache.org/msg40768.html
> >>>>
> >>>> Apart from that, how are those both datacenters connected? Maybe there
> >>>> is a bottleneck.
> >>>>
> >>>> Also do you have ntp up and running on all nodes to keep all clocks in
> >>>> thight sync?
> >>>>
> >>>> Note: I'm no expert (yet) - just sharing my 2 cents.
> >>>>
> >>>> Cheers,
> >>>> Roland
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>>
> >
>


Re: Many pending compactions

2015-02-17 Thread Ja Sam
After some diagnostic ( we didn't set yet cold_reads_to_omit ). Compaction
are running but VERY slow with "idle" IO.

We had a lot of "Data files" in Cassandra. In DC_A it is about ~12
(only xxx-Data.db) in DC_B has only ~4000.

I don't know if this change anything but:
1) in DC_A avg size of Data.db file is ~13 mb. I have few a really big
ones, but most is really small (almost 1 files are less then 100mb).
2) in DC_B avg size of Data.db is much bigger ~260mb.

Do you think that above flag will help us?


On Tue, Feb 17, 2015 at 9:04 AM, Ja Sam  wrote:

> I set setcompactionthroughput 999 permanently and it doesn't change
> anything. IO is still same. CPU is idle.
>
> On Tue, Feb 17, 2015 at 1:15 AM, Roni Balthazar 
> wrote:
>
>> Hi,
>>
>> You can run "nodetool compactionstats" to view statistics on compactions.
>> Setting cold_reads_to_omit to 0.0 can help to reduce the number of
>> SSTables when you use Size-Tiered compaction.
>> You can also create a cron job to increase the value of
>> setcompactionthroughput during the night or when your IO is not busy.
>>
>> From http://wiki.apache.org/cassandra/NodeTool:
>> 0 0 * * * root nodetool -h `hostname` setcompactionthroughput 999
>> 0 6 * * * root nodetool -h `hostname` setcompactionthroughput 16
>>
>> Cheers,
>>
>> Roni Balthazar
>>
>> On Mon, Feb 16, 2015 at 7:47 PM, Ja Sam  wrote:
>> > One think I do not understand. In my case compaction is running
>> permanently.
>> > Is there a way to check which compaction is pending? The only
>> information is
>> > about total count.
>> >
>> >
>> > On Monday, February 16, 2015, Ja Sam  wrote:
>> >>
>> >> Of couse I made a mistake. I am using 2.1.2. Anyway night build is
>> >> available from
>> >> http://cassci.datastax.com/job/cassandra-2.1/
>> >>
>> >> I read about cold_reads_to_omit It looks promising. Should I set also
>> >> compaction throughput?
>> >>
>> >> p.s. I am really sad that I didn't read this before:
>> >>
>> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>> >>
>> >>
>> >>
>> >> On Monday, February 16, 2015, Carlos Rolo  wrote:
>> >>>
>> >>> Hi 100% in agreement with Roland,
>> >>>
>> >>> 2.1.x series is a pain! I would never recommend the current 2.1.x
>> series
>> >>> for production.
>> >>>
>> >>> Clocks is a pain, and check your connectivity! Also check tpstats to
>> see
>> >>> if your threadpools are being overrun.
>> >>>
>> >>> Regards,
>> >>>
>> >>> Carlos Juzarte Rolo
>> >>> Cassandra Consultant
>> >>>
>> >>> Pythian - Love your data
>> >>>
>> >>> rolo@pythian | Twitter: cjrolo | Linkedin:
>> >>> linkedin.com/in/carlosjuzarterolo
>> >>> Tel: 1649
>> >>> www.pythian.com
>> >>>
>> >>> On Mon, Feb 16, 2015 at 8:12 PM, Roland Etzenhammer
>> >>>  wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> 1) Actual Cassandra 2.1.3, it was upgraded from 2.1.0 (suggested by
>> Al
>> >>>> Tobey from DataStax)
>> >>>> 7) minimal reads (usually none, sometimes few)
>> >>>>
>> >>>> those two points keep me repeating an anwser I got. First where did
>> you
>> >>>> get 2.1.3 from? Maybe I missed it, I will have a look. But if it is
>> 2.1.2
>> >>>> whis is the latest released version, that version has many bugs -
>> most of
>> >>>> them I got kicked by while testing 2.1.2. I got many problems with
>> >>>> compactions not beeing triggred on column families not beeing read,
>> >>>> compactions and repairs not beeing completed.  See
>> >>>>
>> >>>>
>> >>>>
>> https://www.mail-archive.com/search?l=user@cassandra.apache.org&q=subject:%22Re%3A+Compaction+failing+to+trigger%22&o=newest&f=1
>> >>>>
>> https://www.mail-archive.com/user%40cassandra.apache.org/msg40768.html
>> >>>>
>> >>>> Apart from that, how are those both datacenters connected? Maybe
>> there
>> >>>> is a bottleneck.
>> >>>>
>> >>>> Also do you have ntp up and running on all nodes to keep all clocks
>> in
>> >>>> thight sync?
>> >>>>
>> >>>> Note: I'm no expert (yet) - just sharing my 2 cents.
>> >>>>
>> >>>> Cheers,
>> >>>> Roland
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>>
>> >>>
>> >>>
>> >
>>
>
>


Re: Many pending compactions

2015-02-18 Thread Ja Sam
Hi,
Thanks for your "tip" it looks that something changed - I still don't know
if it is ok.

My nodes started to do more compaction, but it looks that some compactions
are really slow.
In IO we have idle, CPU is quite ok (30%-40%). We set compactionthrouput to
999, but I do not see difference.

Can we check something more? Or do you have any method to monitor progress
with small files?

Regards

On Tue, Feb 17, 2015 at 2:43 PM, Roni Balthazar 
wrote:

> HI,
>
> Yes... I had the same issue and setting cold_reads_to_omit to 0.0 was
> the solution...
> The number of SSTables decreased from many thousands to a number below
> a hundred and the SSTables are now much bigger with several gigabytes
> (most of them).
>
> Cheers,
>
> Roni Balthazar
>
>
>
> On Tue, Feb 17, 2015 at 11:32 AM, Ja Sam  wrote:
> > After some diagnostic ( we didn't set yet cold_reads_to_omit ).
> Compaction
> > are running but VERY slow with "idle" IO.
> >
> > We had a lot of "Data files" in Cassandra. In DC_A it is about ~12
> (only
> > xxx-Data.db) in DC_B has only ~4000.
> >
> > I don't know if this change anything but:
> > 1) in DC_A avg size of Data.db file is ~13 mb. I have few a really big
> ones,
> > but most is really small (almost 1 files are less then 100mb).
> > 2) in DC_B avg size of Data.db is much bigger ~260mb.
> >
> > Do you think that above flag will help us?
> >
> >
> > On Tue, Feb 17, 2015 at 9:04 AM, Ja Sam  wrote:
> >>
> >> I set setcompactionthroughput 999 permanently and it doesn't change
> >> anything. IO is still same. CPU is idle.
> >>
> >> On Tue, Feb 17, 2015 at 1:15 AM, Roni Balthazar <
> ronibaltha...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> You can run "nodetool compactionstats" to view statistics on
> compactions.
> >>> Setting cold_reads_to_omit to 0.0 can help to reduce the number of
> >>> SSTables when you use Size-Tiered compaction.
> >>> You can also create a cron job to increase the value of
> >>> setcompactionthroughput during the night or when your IO is not busy.
> >>>
> >>> From http://wiki.apache.org/cassandra/NodeTool:
> >>> 0 0 * * * root nodetool -h `hostname` setcompactionthroughput 999
> >>> 0 6 * * * root nodetool -h `hostname` setcompactionthroughput 16
> >>>
> >>> Cheers,
> >>>
> >>> Roni Balthazar
> >>>
> >>> On Mon, Feb 16, 2015 at 7:47 PM, Ja Sam  wrote:
> >>> > One think I do not understand. In my case compaction is running
> >>> > permanently.
> >>> > Is there a way to check which compaction is pending? The only
> >>> > information is
> >>> > about total count.
> >>> >
> >>> >
> >>> > On Monday, February 16, 2015, Ja Sam  wrote:
> >>> >>
> >>> >> Of couse I made a mistake. I am using 2.1.2. Anyway night build is
> >>> >> available from
> >>> >> http://cassci.datastax.com/job/cassandra-2.1/
> >>> >>
> >>> >> I read about cold_reads_to_omit It looks promising. Should I set
> also
> >>> >> compaction throughput?
> >>> >>
> >>> >> p.s. I am really sad that I didn't read this before:
> >>> >>
> >>> >>
> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Monday, February 16, 2015, Carlos Rolo  wrote:
> >>> >>>
> >>> >>> Hi 100% in agreement with Roland,
> >>> >>>
> >>> >>> 2.1.x series is a pain! I would never recommend the current 2.1.x
> >>> >>> series
> >>> >>> for production.
> >>> >>>
> >>> >>> Clocks is a pain, and check your connectivity! Also check tpstats
> to
> >>> >>> see
> >>> >>> if your threadpools are being overrun.
> >>> >>>
> >>> >>> Regards,
> >>> >>>
> >>> >>> Carlos Juzarte Rolo
> >>> >>> Cassandra Consultant
> >>> >>>
> >>> >>> Pythian - Love your data
> >>> >>>
> >>> >>> rolo@pythian | Twitter: cjrolo | Linkedin:
> >

Re: Many pending compactions

2015-02-18 Thread Ja Sam
I don't have problems with DC_B (replica) only in DC_A(my system write only
to it) I have read timeouts.

I checked in OpsCenter SSTable count  and I have:
1) in DC_A  same +-10% for last week, a small increase for last 24h (it is
more than 15000-2 SSTables depends on node)
2) in DC_B last 24h shows up to 50% decrease, which give nice prognostics.
Now I have less then 1000 SSTables

What did you measure during system optimizations? Or do you have an idea
what more should I check?
1) I look at CPU Idle (one node is 50% idle, rest 70% idle)
2) Disk queue -> mostly is it near zero: avg 0.09. Sometimes there are
spikes
3) system RAM usage is almost full
4) In Total Bytes Compacted most most lines are below 3MB/s. For total DC_A
it is less than 10MB/s, in DC_B it looks much better (avg is like 17MB/s)

something else?



On Wed, Feb 18, 2015 at 1:32 PM, Roni Balthazar 
wrote:

> Hi,
>
> You can check if the number of SSTables is decreasing. Look for the
> "SSTable count" information of your tables using "nodetool cfstats".
> The compaction history can be viewed using "nodetool
> compactionhistory".
>
> About the timeouts, check this out:
> http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
> Also try to run "nodetool tpstats" to see the threads statistics. It
> can lead you to know if you are having performance problems. If you
> are having too many pending tasks or dropped messages, maybe will you
> need to tune your system (eg: driver's timeout, concurrent reads and
> so on)
>
> Regards,
>
> Roni Balthazar
>
> On Wed, Feb 18, 2015 at 9:51 AM, Ja Sam  wrote:
> > Hi,
> > Thanks for your "tip" it looks that something changed - I still don't
> know
> > if it is ok.
> >
> > My nodes started to do more compaction, but it looks that some
> compactions
> > are really slow.
> > In IO we have idle, CPU is quite ok (30%-40%). We set compactionthrouput
> to
> > 999, but I do not see difference.
> >
> > Can we check something more? Or do you have any method to monitor
> progress
> > with small files?
> >
> > Regards
> >
> > On Tue, Feb 17, 2015 at 2:43 PM, Roni Balthazar  >
> > wrote:
> >>
> >> HI,
> >>
> >> Yes... I had the same issue and setting cold_reads_to_omit to 0.0 was
> >> the solution...
> >> The number of SSTables decreased from many thousands to a number below
> >> a hundred and the SSTables are now much bigger with several gigabytes
> >> (most of them).
> >>
> >> Cheers,
> >>
> >> Roni Balthazar
> >>
> >>
> >>
> >> On Tue, Feb 17, 2015 at 11:32 AM, Ja Sam  wrote:
> >> > After some diagnostic ( we didn't set yet cold_reads_to_omit ).
> >> > Compaction
> >> > are running but VERY slow with "idle" IO.
> >> >
> >> > We had a lot of "Data files" in Cassandra. In DC_A it is about ~12
> >> > (only
> >> > xxx-Data.db) in DC_B has only ~4000.
> >> >
> >> > I don't know if this change anything but:
> >> > 1) in DC_A avg size of Data.db file is ~13 mb. I have few a really big
> >> > ones,
> >> > but most is really small (almost 1 files are less then 100mb).
> >> > 2) in DC_B avg size of Data.db is much bigger ~260mb.
> >> >
> >> > Do you think that above flag will help us?
> >> >
> >> >
> >> > On Tue, Feb 17, 2015 at 9:04 AM, Ja Sam  wrote:
> >> >>
> >> >> I set setcompactionthroughput 999 permanently and it doesn't change
> >> >> anything. IO is still same. CPU is idle.
> >> >>
> >> >> On Tue, Feb 17, 2015 at 1:15 AM, Roni Balthazar
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> You can run "nodetool compactionstats" to view statistics on
> >> >>> compactions.
> >> >>> Setting cold_reads_to_omit to 0.0 can help to reduce the number of
> >> >>> SSTables when you use Size-Tiered compaction.
> >> >>> You can also create a cron job to increase the value of
> >> >>> setcompactionthroughput during the night or when your IO is not
> busy.
> >> >>>
> >> >>> From http://wiki.apache.org/cassandra/NodeTool:
> >> >>> 0 0 * * * root nodetool -h `hostname` setcompactionthroughput 999
> >> >>> 0 6 * * * root node

Re: Many pending compactions

2015-02-18 Thread Ja Sam
1) we tried to run repairs but they usually does not succeed. But we had
Leveled compaction before. Last week we ALTER tables to STCS, because guys
from DataStax suggest us that we should not use Leveled and alter tables in
STCS, because we don't have SSD. After this change we did not run any
repair. Anyway I don't think it will change anything in SSTable count - if
I am wrong please give me an information

2) I did this. My tables are 99% write only. It is audit system

3) Yes I am using default values

4) In both operations I am using LOCAL_QUORUM.

I am almost sure that READ timeout happens because of too much SSTables.
Anyway firstly I would like to fix to many pending compactions. I still
don't know how to speed up them.


On Wed, Feb 18, 2015 at 2:49 PM, Roni Balthazar 
wrote:

> Are you running repairs within gc_grace_seconds? (default is 10 days)
>
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
>
> Double check if you set cold_reads_to_omit to 0.0 on tables with STCS
> that you do not read often.
>
> Are you using default values for the properties
> min_compaction_threshold(4) and max_compaction_threshold(32)?
>
> Which Consistency Level are you using for reading operations? Check if
> you are not reading from DC_B due to your Replication Factor and CL.
>
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
>
>
> Cheers,
>
> Roni Balthazar
>
> On Wed, Feb 18, 2015 at 11:07 AM, Ja Sam  wrote:
> > I don't have problems with DC_B (replica) only in DC_A(my system write
> only
> > to it) I have read timeouts.
> >
> > I checked in OpsCenter SSTable count  and I have:
> > 1) in DC_A  same +-10% for last week, a small increase for last 24h (it
> is
> > more than 15000-2 SSTables depends on node)
> > 2) in DC_B last 24h shows up to 50% decrease, which give nice
> prognostics.
> > Now I have less then 1000 SSTables
> >
> > What did you measure during system optimizations? Or do you have an idea
> > what more should I check?
> > 1) I look at CPU Idle (one node is 50% idle, rest 70% idle)
> > 2) Disk queue -> mostly is it near zero: avg 0.09. Sometimes there are
> > spikes
> > 3) system RAM usage is almost full
> > 4) In Total Bytes Compacted most most lines are below 3MB/s. For total
> DC_A
> > it is less than 10MB/s, in DC_B it looks much better (avg is like 17MB/s)
> >
> > something else?
> >
> >
> >
> > On Wed, Feb 18, 2015 at 1:32 PM, Roni Balthazar  >
> > wrote:
> >>
> >> Hi,
> >>
> >> You can check if the number of SSTables is decreasing. Look for the
> >> "SSTable count" information of your tables using "nodetool cfstats".
> >> The compaction history can be viewed using "nodetool
> >> compactionhistory".
> >>
> >> About the timeouts, check this out:
> >>
> http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
> >> Also try to run "nodetool tpstats" to see the threads statistics. It
> >> can lead you to know if you are having performance problems. If you
> >> are having too many pending tasks or dropped messages, maybe will you
> >> need to tune your system (eg: driver's timeout, concurrent reads and
> >> so on)
> >>
> >> Regards,
> >>
> >> Roni Balthazar
> >>
> >> On Wed, Feb 18, 2015 at 9:51 AM, Ja Sam  wrote:
> >> > Hi,
> >> > Thanks for your "tip" it looks that something changed - I still don't
> >> > know
> >> > if it is ok.
> >> >
> >> > My nodes started to do more compaction, but it looks that some
> >> > compactions
> >> > are really slow.
> >> > In IO we have idle, CPU is quite ok (30%-40%). We set
> compactionthrouput
> >> > to
> >> > 999, but I do not see difference.
> >> >
> >> > Can we check something more? Or do you have any method to monitor
> >> > progress
> >> > with small files?
> >> >
> >> > Regards
> >> >
> >> > On Tue, Feb 17, 2015 at 2:43 PM, Roni Balthazar
> >> > 
> >> > wrote:
> >> >>
> >> >> HI,
> >> >>
> >> >> Yes... I had the same issue and setting cold_reads_to_omit to 0.0 was
> >> >> the solution...
> >> >> The number of SSTables decreased from many thousands to a number
> below
> >> >> a hundred and the SSTable

Re: Many pending compactions

2015-02-18 Thread Ja Sam
Can you explain me what is the correlation between growing SSTables and
repair?
I was sure, until your  mail, that repair is only to make data consistent
between nodes.

Regards


On Wed, Feb 18, 2015 at 4:20 PM, Roni Balthazar 
wrote:

> Which error are you getting when running repairs?
> You need to run repair on your nodes within gc_grace_seconds (eg:
> weekly). They have data that are not read frequently. You can run
> "repair -pr" on all nodes. Since you do not have deletes, you will not
> have trouble with that. If you have deletes, it's better to increase
> gc_grace_seconds before the repair.
>
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
> After repair, try to run a "nodetool cleanup".
>
> Check if the number of SSTables goes down after that... Pending
> compactions must decrease as well...
>
> Cheers,
>
> Roni Balthazar
>
>
>
>
> On Wed, Feb 18, 2015 at 12:39 PM, Ja Sam  wrote:
> > 1) we tried to run repairs but they usually does not succeed. But we had
> > Leveled compaction before. Last week we ALTER tables to STCS, because
> guys
> > from DataStax suggest us that we should not use Leveled and alter tables
> in
> > STCS, because we don't have SSD. After this change we did not run any
> > repair. Anyway I don't think it will change anything in SSTable count -
> if I
> > am wrong please give me an information
> >
> > 2) I did this. My tables are 99% write only. It is audit system
> >
> > 3) Yes I am using default values
> >
> > 4) In both operations I am using LOCAL_QUORUM.
> >
> > I am almost sure that READ timeout happens because of too much SSTables.
> > Anyway firstly I would like to fix to many pending compactions. I still
> > don't know how to speed up them.
> >
> >
> > On Wed, Feb 18, 2015 at 2:49 PM, Roni Balthazar  >
> > wrote:
> >>
> >> Are you running repairs within gc_grace_seconds? (default is 10 days)
> >>
> >>
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
> >>
> >> Double check if you set cold_reads_to_omit to 0.0 on tables with STCS
> >> that you do not read often.
> >>
> >> Are you using default values for the properties
> >> min_compaction_threshold(4) and max_compaction_threshold(32)?
> >>
> >> Which Consistency Level are you using for reading operations? Check if
> >> you are not reading from DC_B due to your Replication Factor and CL.
> >>
> >>
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
> >>
> >>
> >> Cheers,
> >>
> >> Roni Balthazar
> >>
> >> On Wed, Feb 18, 2015 at 11:07 AM, Ja Sam  wrote:
> >> > I don't have problems with DC_B (replica) only in DC_A(my system write
> >> > only
> >> > to it) I have read timeouts.
> >> >
> >> > I checked in OpsCenter SSTable count  and I have:
> >> > 1) in DC_A  same +-10% for last week, a small increase for last 24h
> (it
> >> > is
> >> > more than 15000-2 SSTables depends on node)
> >> > 2) in DC_B last 24h shows up to 50% decrease, which give nice
> >> > prognostics.
> >> > Now I have less then 1000 SSTables
> >> >
> >> > What did you measure during system optimizations? Or do you have an
> idea
> >> > what more should I check?
> >> > 1) I look at CPU Idle (one node is 50% idle, rest 70% idle)
> >> > 2) Disk queue -> mostly is it near zero: avg 0.09. Sometimes there are
> >> > spikes
> >> > 3) system RAM usage is almost full
> >> > 4) In Total Bytes Compacted most most lines are below 3MB/s. For total
> >> > DC_A
> >> > it is less than 10MB/s, in DC_B it looks much better (avg is like
> >> > 17MB/s)
> >> >
> >> > something else?
> >> >
> >> >
> >> >
> >> > On Wed, Feb 18, 2015 at 1:32 PM, Roni Balthazar
> >> > 
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> You can check if the number of SSTables is decreasing. Look for the
> >> >> "SSTable count" information of your tables using "nodetool cfstats".
> >> >> The compaction history can be viewed using "nodetool
> >> >> compactionhistory".
> >> >>
> >> >> About t

Re: Many pending compactions

2015-02-18 Thread Ja Sam
ad 3)  I did this already yesterday (setcompactionthrouput also). But still
SSTables are increasing.

ad 1) What do you think I should use -pr or try to use incremental?



On Wed, Feb 18, 2015 at 4:54 PM, Roni Balthazar 
wrote:

> You are right... Repair makes the data consistent between nodes.
>
> I understand that you have 2 issues going on.
>
> You need to run repair periodically without errors and need to decrease
> the numbers of compactions pending.
>
> So I suggest:
>
> 1) Run repair -pr on all nodes. If you upgrade to the new 2.1.3, you can
> use incremental repairs. There were some bugs on 2.1.2.
> 2) Run cleanup on all nodes
> 3) Since you have too many cold SSTables, set cold_reads_to_omit to 0.0,
> and increase setcompactionthroughput for some time and see if the number
> of SSTables is going down.
>
> Let us know what errors are you getting when running repairs.
>
> Regards,
>
> Roni Balthazar
>
>
> On Wed, Feb 18, 2015 at 1:31 PM, Ja Sam  wrote:
>
>> Can you explain me what is the correlation between growing SSTables and
>> repair?
>> I was sure, until your  mail, that repair is only to make data consistent
>> between nodes.
>>
>> Regards
>>
>>
>> On Wed, Feb 18, 2015 at 4:20 PM, Roni Balthazar 
>> wrote:
>>
>>> Which error are you getting when running repairs?
>>> You need to run repair on your nodes within gc_grace_seconds (eg:
>>> weekly). They have data that are not read frequently. You can run
>>> "repair -pr" on all nodes. Since you do not have deletes, you will not
>>> have trouble with that. If you have deletes, it's better to increase
>>> gc_grace_seconds before the repair.
>>>
>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
>>> After repair, try to run a "nodetool cleanup".
>>>
>>> Check if the number of SSTables goes down after that... Pending
>>> compactions must decrease as well...
>>>
>>> Cheers,
>>>
>>> Roni Balthazar
>>>
>>>
>>>
>>>
>>> On Wed, Feb 18, 2015 at 12:39 PM, Ja Sam  wrote:
>>> > 1) we tried to run repairs but they usually does not succeed. But we
>>> had
>>> > Leveled compaction before. Last week we ALTER tables to STCS, because
>>> guys
>>> > from DataStax suggest us that we should not use Leveled and alter
>>> tables in
>>> > STCS, because we don't have SSD. After this change we did not run any
>>> > repair. Anyway I don't think it will change anything in SSTable count
>>> - if I
>>> > am wrong please give me an information
>>> >
>>> > 2) I did this. My tables are 99% write only. It is audit system
>>> >
>>> > 3) Yes I am using default values
>>> >
>>> > 4) In both operations I am using LOCAL_QUORUM.
>>> >
>>> > I am almost sure that READ timeout happens because of too much
>>> SSTables.
>>> > Anyway firstly I would like to fix to many pending compactions. I still
>>> > don't know how to speed up them.
>>> >
>>> >
>>> > On Wed, Feb 18, 2015 at 2:49 PM, Roni Balthazar <
>>> ronibaltha...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Are you running repairs within gc_grace_seconds? (default is 10 days)
>>> >>
>>> >>
>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
>>> >>
>>> >> Double check if you set cold_reads_to_omit to 0.0 on tables with STCS
>>> >> that you do not read often.
>>> >>
>>> >> Are you using default values for the properties
>>> >> min_compaction_threshold(4) and max_compaction_threshold(32)?
>>> >>
>>> >> Which Consistency Level are you using for reading operations? Check if
>>> >> you are not reading from DC_B due to your Replication Factor and CL.
>>> >>
>>> >>
>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
>>> >>
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Roni Balthazar
>>> >>
>>> >> On Wed, Feb 18, 2015 at 11:07 AM, Ja Sam  wrote:
>>> >> > I don't have problems with DC_B (replica) only in DC_A(my system
>>> write
>>> >> > only
>

Re: Many pending compactions

2015-02-18 Thread Ja Sam
As Al Tobey suggest me I upgraded my 2.1.0 to snaphot version of 2.1.3. I
have now installed exactly this build:
https://cassci.datastax.com/job/cassandra-2.1/912/
I see many compaction which completes, but some of them are really slow.
Maybe I should send some stats form OpsCenter or servers? But it is
difficult to me to choose what is important

Regards



On Wed, Feb 18, 2015 at 6:11 PM, Jake Luciani  wrote:

> Ja, Please upgrade to official 2.1.3 we've fixed many things related to
> compaction.  Are you seeing the compactions % complete progress at all?
>
> On Wed, Feb 18, 2015 at 11:58 AM, Roni Balthazar 
> wrote:
>
>> Try repair -pr on all nodes.
>>
>> If after that you still have issues, you can try to rebuild the SSTables
>> using nodetool upgradesstables or scrub.
>>
>> Regards,
>>
>> Roni Balthazar
>>
>> Em 18/02/2015, às 14:13, Ja Sam  escreveu:
>>
>> ad 3)  I did this already yesterday (setcompactionthrouput also). But
>> still SSTables are increasing.
>>
>> ad 1) What do you think I should use -pr or try to use incremental?
>>
>>
>>
>> On Wed, Feb 18, 2015 at 4:54 PM, Roni Balthazar 
>> wrote:
>>
>>> You are right... Repair makes the data consistent between nodes.
>>>
>>> I understand that you have 2 issues going on.
>>>
>>> You need to run repair periodically without errors and need to decrease
>>> the numbers of compactions pending.
>>>
>>> So I suggest:
>>>
>>> 1) Run repair -pr on all nodes. If you upgrade to the new 2.1.3, you can
>>> use incremental repairs. There were some bugs on 2.1.2.
>>> 2) Run cleanup on all nodes
>>> 3) Since you have too many cold SSTables, set cold_reads_to_omit to
>>> 0.0, and increase setcompactionthroughput for some time and see if the
>>> number of SSTables is going down.
>>>
>>> Let us know what errors are you getting when running repairs.
>>>
>>> Regards,
>>>
>>> Roni Balthazar
>>>
>>>
>>> On Wed, Feb 18, 2015 at 1:31 PM, Ja Sam  wrote:
>>>
>>>> Can you explain me what is the correlation between growing SSTables and
>>>> repair?
>>>> I was sure, until your  mail, that repair is only to make data
>>>> consistent between nodes.
>>>>
>>>> Regards
>>>>
>>>>
>>>> On Wed, Feb 18, 2015 at 4:20 PM, Roni Balthazar <
>>>> ronibaltha...@gmail.com> wrote:
>>>>
>>>>> Which error are you getting when running repairs?
>>>>> You need to run repair on your nodes within gc_grace_seconds (eg:
>>>>> weekly). They have data that are not read frequently. You can run
>>>>> "repair -pr" on all nodes. Since you do not have deletes, you will not
>>>>> have trouble with that. If you have deletes, it's better to increase
>>>>> gc_grace_seconds before the repair.
>>>>>
>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
>>>>> After repair, try to run a "nodetool cleanup".
>>>>>
>>>>> Check if the number of SSTables goes down after that... Pending
>>>>> compactions must decrease as well...
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Roni Balthazar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 18, 2015 at 12:39 PM, Ja Sam  wrote:
>>>>> > 1) we tried to run repairs but they usually does not succeed. But we
>>>>> had
>>>>> > Leveled compaction before. Last week we ALTER tables to STCS,
>>>>> because guys
>>>>> > from DataStax suggest us that we should not use Leveled and alter
>>>>> tables in
>>>>> > STCS, because we don't have SSD. After this change we did not run any
>>>>> > repair. Anyway I don't think it will change anything in SSTable
>>>>> count - if I
>>>>> > am wrong please give me an information
>>>>> >
>>>>> > 2) I did this. My tables are 99% write only. It is audit system
>>>>> >
>>>>> > 3) Yes I am using default values
>>>>> >
>>>>> > 4) In both operations I am using LOCAL_QUORUM.
>>>>> >
>>>>> > I am almost sure that READ timeout happens because of too much
>>>>> SSTables

Re: Many pending compactions

2015-02-24 Thread Ja Sam
The repair results is following (we run it Friday): Cannot proceed on
repair because a neighbor (/192.168.61.201) is dead: session failed

But to be honest the neighbor did not died. It seemed to trigger a series
of full GC events on the initiating node. The results form logs are:

[2015-02-20 16:47:54,884] Starting repair command #2, repairing 7 ranges
for keyspace prem_maelstrom_2 (parallelism=PARALLEL, full=false)
[2015-02-21 02:21:55,640] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 02:22:55,642] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 02:23:55,642] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 02:24:55,644] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 04:41:08,607] Repair session
d5d01dd0-b917-11e4-bc97-e9a66e5b2124 for range
(85070591730234615865843651857942052874,102084710076281535261119195933814292480]
failed with error org.apache.cassandra.exceptions.RepairException: [repair
#d5d01dd0-b917-11e4-bc97-e9a66e5b2124 on prem_maelstrom_2/customer_events,
(85070591730234615865843651857942052874,102084710076281535261119195933814292480]]
Sync failed between /192.168.71.196 and /192.168.61.199
[2015-02-21 04:41:08,608] Repair session
eb8d8d10-b967-11e4-bc97-e9a66e5b2124 for range
(68056473384187696470568107782069813248,85070591730234615865843651857942052874]
failed with error java.io.IOException: Endpoint /192.168.61.199 died
[2015-02-21 04:41:08,608] Repair session
c48aef00-b971-11e4-bc97-e9a66e5b2124 for range (0,10] failed with error
java.io.IOException: Cannot proceed on repair because a neighbor (/
192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,609] Repair session
c48d38f0-b971-11e4-bc97-e9a66e5b2124 for range
(42535295865117307932921825928971026442,68056473384187696470568107782069813248]
failed with error java.io.IOException: Cannot proceed on repair because a
neighbor (/192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,609] Repair session
c48d38f1-b971-11e4-bc97-e9a66e5b2124 for range
(127605887595351923798765477786913079306,136112946768375392941136215564139626496]
failed with error java.io.IOException: Cannot proceed on repair because a
neighbor (/192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,619] Repair session
c48d6000-b971-11e4-bc97-e9a66e5b2124 for range
(136112946768375392941136215564139626496,0] failed with error
java.io.IOException: Cannot proceed on repair because a neighbor (/
192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,620] Repair session
c48d6001-b971-11e4-bc97-e9a66e5b2124 for range
(102084710076281535261119195933814292480,127605887595351923798765477786913079306]
failed with error java.io.IOException: Cannot proceed on repair because a
neighbor (/192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,620] Repair command #2 finished


We tried to run repair one more time. After 24 hour have some streaming
errors. Moreover we have to stop it because we start to have write timeouts
on client :(

We check iostat when we have write timeouts. Example from one node in DC_A
are here:
The file also contains tpstats from all nodes.Nodes starting with "z" are
in DC_B, rest is in DC_A
Cassandra is data and commit log are on disk dm-XX.

I also read
http://jonathanhui.com/cassandra-performance-tuning-and-monitoring and I
think about:
1) memtable configuration - do you have some suggestion?
2) run INSERT in batch statements - I am not sure if this reduce IO, again
do you have experience with this?

Any tips will be helpful

Regards
Piotrek

On Thu, Feb 19, 2015 at 10:34 AM, Roland Etzenhammer <
r.etzenham...@t-online.de> wrote:

> Hi,
>
> 2.1.3 is now the official latest release - I checked this morning and got
> this good surprise. Now it's update time - thanks to all guys involved, if
> I meet anyone one beer from me :-)
>
> The changelist is rather long:
> https://git1-us-west.apache.org/repos/asf?p=cassandra.git;
> a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-2.1.3
>
> Hopefully that will solve many of those oddities and not invent to much
> new ones :-)
>
> Cheers,
> Roland
>
>
>


Possible problem with disk latency

2015-02-25 Thread Ja Sam
Hi,
I write some question before about my problems with C* cluster. All my
environment is described here:
https://www.mail-archive.com/user@cassandra.apache.org/msg40982.html
To sum up I have thousands SSTables in one DC and much much less in second.
I write only to first DC.

Anyway after reading a lot of post/mails/google I start to think that the
only reason of above is disk problems.

My OpsCenter with some stats is following:
https://drive.google.com/file/d/0B4N_AbBPGGwLR21CZk9OV1kxVDA/view

My iostats are like this:
https://drive.google.com/file/d/0B4N_AbBPGGwLTTZEeG1SYkF0cXc/view
(dm-XX are C* drives. dm-11 is for commitlog)

If You could be so kind and validate above and give me an answer is my disk
are real problems or not? And give me a tip what should I do with above
cluster? Maybe I have misconfiguration?

Regards
Piotrek


Re: Possible problem with disk latency

2015-02-25 Thread Ja Sam
I read that I shouldn't install version less than 6 in the end. But I
started with 2.1.0. Then I upgraded to 2.1.3.

But as I know, I cannot downgrade it

On Wed, Feb 25, 2015 at 12:05 PM, Carlos Rolo  wrote:

> Your latency doesn't seem that high that can cause that problem. I suspect
> more of a problem with the Cassandra version (2.1.3) than that with the
> hard drives. I didn't look deep into the information provided but for your
> reference, the only time I had serious (leading to OOM and all sort of
> weird behavior) my hard drives where near 70ms latency.
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
> <http://linkedin.com/in/carlosjuzarterolo>*
> Tel: 1649
> www.pythian.com
>
> On Wed, Feb 25, 2015 at 11:19 AM, Ja Sam  wrote:
>
>> Hi,
>> I write some question before about my problems with C* cluster. All my
>> environment is described here:
>> https://www.mail-archive.com/user@cassandra.apache.org/msg40982.html
>> To sum up I have thousands SSTables in one DC and much much less in
>> second. I write only to first DC.
>>
>> Anyway after reading a lot of post/mails/google I start to think that the
>> only reason of above is disk problems.
>>
>> My OpsCenter with some stats is following:
>> https://drive.google.com/file/d/0B4N_AbBPGGwLR21CZk9OV1kxVDA/view
>>
>> My iostats are like this:
>> https://drive.google.com/file/d/0B4N_AbBPGGwLTTZEeG1SYkF0cXc/view
>> (dm-XX are C* drives. dm-11 is for commitlog)
>>
>> If You could be so kind and validate above and give me an answer is my
>> disk are real problems or not? And give me a tip what should I do with
>> above cluster? Maybe I have misconfiguration?
>>
>> Regards
>> Piotrek
>>
>
>
> --
>
>
>
>


Re: Possible problem with disk latency

2015-02-25 Thread Ja Sam
I do NOT have SSD. I have normal HDD group by JBOD.
My CF have SizeTieredCompactionStrategy
I am using local quorum for reads and writes. To be precise I have a lot of
writes and almost 0 reads.
I changed "cold_reads_to_omit" to 0.0 as someone suggest me. I used set
compactionthrouput to 999.

So if my disk are idle, my CPU is less then 40%, I have some free RAM - why
SSTables count is growing? How I can speed up compactions?

On Wed, Feb 25, 2015 at 5:16 PM, Nate McCall  wrote:

>
>
>> If You could be so kind and validate above and give me an answer is my
>> disk are real problems or not? And give me a tip what should I do with
>> above cluster? Maybe I have misconfiguration?
>>
>>
>>
> You disks are effectively idle. What consistency level are you using for
> reads and writes?
>
> Actually, 'await' is sort of weirdly high for idle SSDs. Check your
> interrupt mappings (cat /proc/interrupts) and make sure the interrupts are
> not being stacked on a single CPU.
>
>
>


Re: Possible problem with disk latency

2015-02-25 Thread Ja Sam
Hi Roni,

It is not balanced. As I wrote you last week I have problems only in DC in
which we writes (on screen it is named as AGRAF:
https://drive.google.com/file/d/0B4N_AbBPGGwLR21CZk9OV1kxVDA/view). The
problem is on ALL nodes in this dc.
In second DC (ZETO) only one node have more than 30 SSTables and pending
compactions are decreasing to zero.

In AGRAF the minimum pending compaction is 2500 , maximum is 6000 (avg on
screen from opscenter is less then 5000)


Regards
Piotrek.

p.s. I don't know why my mail client display my name as Ja Sam instead of
Piotr Stapp, but this doesn't change anything :)


On Wed, Feb 25, 2015 at 5:45 PM, Roni Balthazar 
wrote:

> Hi Ja,
>
> How are the pending compactions distributed between the nodes?
> Run "nodetool compactionstats" on all of your nodes and check if the
> pendings tasks are balanced or they are concentrated in only few
> nodes.
> You also can check the if the SSTable count is balanced running
> "nodetool cfstats" on your nodes.
>
> Cheers,
>
> Roni Balthazar
>
>
>
> On 25 February 2015 at 13:29, Ja Sam  wrote:
> > I do NOT have SSD. I have normal HDD group by JBOD.
> > My CF have SizeTieredCompactionStrategy
> > I am using local quorum for reads and writes. To be precise I have a lot
> of
> > writes and almost 0 reads.
> > I changed "cold_reads_to_omit" to 0.0 as someone suggest me. I used set
> > compactionthrouput to 999.
> >
> > So if my disk are idle, my CPU is less then 40%, I have some free RAM -
> why
> > SSTables count is growing? How I can speed up compactions?
> >
> > On Wed, Feb 25, 2015 at 5:16 PM, Nate McCall 
> wrote:
> >>
> >>
> >>>
> >>> If You could be so kind and validate above and give me an answer is my
> >>> disk are real problems or not? And give me a tip what should I do with
> above
> >>> cluster? Maybe I have misconfiguration?
> >>>
> >>>
> >>
> >> You disks are effectively idle. What consistency level are you using for
> >> reads and writes?
> >>
> >> Actually, 'await' is sort of weirdly high for idle SSDs. Check your
> >> interrupt mappings (cat /proc/interrupts) and make sure the interrupts
> are
> >> not being stacked on a single CPU.
> >>
> >>
> >
>


Re: Possible problem with disk latency

2015-02-25 Thread Ja Sam
Hi, Roni,
They aren't exactly balanced but as I wrote before they are in range from
2500-6000.
If you need exactly data I will check them tomorrow morning. But all nodes
in AGRAF have small increase of pending compactions during last week, which
is "wrong direction"

I will check in the morning get compaction throuput, but my feeling about
this parameter is that it doesn't change anything.

Regards
Piotr




On Wed, Feb 25, 2015 at 7:34 PM, Roni Balthazar 
wrote:

> Hi Piotr,
>
> What about the nodes on AGRAF? Are the pending tasks balanced between
> this DC nodes as well?
> You can check the pending compactions on each node.
>
> Also try to run "nodetool getcompactionthroughput" on all nodes and
> check if the compaction throughput is set to 999.
>
> Cheers,
>
> Roni Balthazar
>
> On 25 February 2015 at 14:47, Ja Sam  wrote:
> > Hi Roni,
> >
> > It is not balanced. As I wrote you last week I have problems only in DC
> in
> > which we writes (on screen it is named as AGRAF:
> > https://drive.google.com/file/d/0B4N_AbBPGGwLR21CZk9OV1kxVDA/view). The
> > problem is on ALL nodes in this dc.
> > In second DC (ZETO) only one node have more than 30 SSTables and pending
> > compactions are decreasing to zero.
> >
> > In AGRAF the minimum pending compaction is 2500 , maximum is 6000 (avg on
> > screen from opscenter is less then 5000)
> >
> >
> > Regards
> > Piotrek.
> >
> > p.s. I don't know why my mail client display my name as Ja Sam instead of
> > Piotr Stapp, but this doesn't change anything :)
> >
> >
> > On Wed, Feb 25, 2015 at 5:45 PM, Roni Balthazar  >
> > wrote:
> >>
> >> Hi Ja,
> >>
> >> How are the pending compactions distributed between the nodes?
> >> Run "nodetool compactionstats" on all of your nodes and check if the
> >> pendings tasks are balanced or they are concentrated in only few
> >> nodes.
> >> You also can check the if the SSTable count is balanced running
> >> "nodetool cfstats" on your nodes.
> >>
> >> Cheers,
> >>
> >> Roni Balthazar
> >>
> >>
> >>
> >> On 25 February 2015 at 13:29, Ja Sam  wrote:
> >> > I do NOT have SSD. I have normal HDD group by JBOD.
> >> > My CF have SizeTieredCompactionStrategy
> >> > I am using local quorum for reads and writes. To be precise I have a
> lot
> >> > of
> >> > writes and almost 0 reads.
> >> > I changed "cold_reads_to_omit" to 0.0 as someone suggest me. I used
> set
> >> > compactionthrouput to 999.
> >> >
> >> > So if my disk are idle, my CPU is less then 40%, I have some free RAM
> -
> >> > why
> >> > SSTables count is growing? How I can speed up compactions?
> >> >
> >> > On Wed, Feb 25, 2015 at 5:16 PM, Nate McCall 
> >> > wrote:
> >> >>
> >> >>
> >> >>>
> >> >>> If You could be so kind and validate above and give me an answer is
> my
> >> >>> disk are real problems or not? And give me a tip what should I do
> with
> >> >>> above
> >> >>> cluster? Maybe I have misconfiguration?
> >> >>>
> >> >>>
> >> >>
> >> >> You disks are effectively idle. What consistency level are you using
> >> >> for
> >> >> reads and writes?
> >> >>
> >> >> Actually, 'await' is sort of weirdly high for idle SSDs. Check your
> >> >> interrupt mappings (cat /proc/interrupts) and make sure the
> interrupts
> >> >> are
> >> >> not being stacked on a single CPU.
> >> >>
> >> >>
> >> >
> >
> >
>


Re: Possible problem with disk latency

2015-02-25 Thread Ja Sam
Hi Roni,
The repair results is following (we run it Friday): Cannot proceed on
repair because a neighbor (/192.168.61.201) is dead: session failed

But to be honest the neighbor did not died. It seemed to trigger a series
of full GC events on the initiating node. The results form logs are:

[2015-02-20 16:47:54,884] Starting repair command #2, repairing 7 ranges
for keyspace prem_maelstrom_2 (parallelism=PARALLEL, full=false)
[2015-02-21 02:21:55,640] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 02:22:55,642] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 02:23:55,642] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 02:24:55,644] Lost notification. You should check server log
for repair status of keyspace prem_maelstrom_2
[2015-02-21 04:41:08,607] Repair session
d5d01dd0-b917-11e4-bc97-e9a66e5b2124 for range
(85070591730234615865843651857942052874,102084710076281535261119195933814292480]
failed with error org.apache.cassandra.exceptions.RepairException: [repair
#d5d01dd0-b917-11e4-bc97-e9a66e5b2124 on prem_maelstrom_2/customer_events,
(85070591730234615865843651857942052874,102084710076281535261119195933814292480]]
Sync failed between /192.168.71.196 and /192.168.61.199
[2015-02-21 04:41:08,608] Repair session
eb8d8d10-b967-11e4-bc97-e9a66e5b2124 for range
(68056473384187696470568107782069813248,85070591730234615865843651857942052874]
failed with error java.io.IOException: Endpoint /192.168.61.199 died
[2015-02-21 04:41:08,608] Repair session
c48aef00-b971-11e4-bc97-e9a66e5b2124 for range (0,10] failed with error
java.io.IOException: Cannot proceed on repair because a neighbor (/
192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,609] Repair session
c48d38f0-b971-11e4-bc97-e9a66e5b2124 for range
(42535295865117307932921825928971026442,68056473384187696470568107782069813248]
failed with error java.io.IOException: Cannot proceed on repair because a
neighbor (/192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,609] Repair session
c48d38f1-b971-11e4-bc97-e9a66e5b2124 for range
(127605887595351923798765477786913079306,136112946768375392941136215564139626496]
failed with error java.io.IOException: Cannot proceed on repair because a
neighbor (/192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,619] Repair session
c48d6000-b971-11e4-bc97-e9a66e5b2124 for range
(136112946768375392941136215564139626496,0] failed with error
java.io.IOException: Cannot proceed on repair because a neighbor (/
192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,620] Repair session
c48d6001-b971-11e4-bc97-e9a66e5b2124 for range
(102084710076281535261119195933814292480,127605887595351923798765477786913079306]
failed with error java.io.IOException: Cannot proceed on repair because a
neighbor (/192.168.61.201) is dead: session failed
[2015-02-21 04:41:08,620] Repair command #2 finished


We tried to run repair one more time. After 24 hour have some streaming
errors. Moreover, 2-3 hours later, we have to stop it because we start to
have write timeouts on client and our system starts to dying.
The iostats from "dying" time plus tpstats are available here:
https://drive.google.com/file/d/0B4N_AbBPGGwLc25nU0lnY3Z5NDA/view



On Wed, Feb 25, 2015 at 7:50 PM, Roni Balthazar 
wrote:

> Hi Piotr,
>
> Are your repairs finishing without errors?
>
> Regards,
>
> Roni Balthazar
>
> On 25 February 2015 at 15:43, Ja Sam  wrote:
> > Hi, Roni,
> > They aren't exactly balanced but as I wrote before they are in range from
> > 2500-6000.
> > If you need exactly data I will check them tomorrow morning. But all
> nodes
> > in AGRAF have small increase of pending compactions during last week,
> which
> > is "wrong direction"
> >
> > I will check in the morning get compaction throuput, but my feeling about
> > this parameter is that it doesn't change anything.
> >
> > Regards
> > Piotr
> >
> >
> >
> >
> > On Wed, Feb 25, 2015 at 7:34 PM, Roni Balthazar  >
> > wrote:
> >>
> >> Hi Piotr,
> >>
> >> What about the nodes on AGRAF? Are the pending tasks balanced between
> >> this DC nodes as well?
> >> You can check the pending compactions on each node.
> >>
> >> Also try to run "nodetool getcompactionthroughput" on all nodes and
> >> check if the compaction throughput is set to 999.
> >>
> >> Cheers,
> >>
> >> Roni Balthazar
> >>
> >> On 25 February 2015 at 14:47, Ja Sam  wrote:
> >> > Hi Roni,
> >> >
> >> > It is not balanced. As I wrote you last week I have problems only in
> DC
> >> >

Re: Possible problem with disk latency

2015-02-25 Thread Ja Sam
Hi,
It is not obvious, because data is replicated to second data center. We
check it "manually" for random records we put into Cassandra and we find
all of them in secondary DC.
We know about every single GC failure, but this doesn't change anything.
The problem with GC failure is only one: restart the node. For few days we
do not have GC errors anymore. It looks for me like memory leaks.
We use Chef.

By MANUAL compaction you mean running nodetool compact?  What does it
change to permanently running compactions?

Regards
Piotrek

On Wed, Feb 25, 2015 at 8:13 PM, daemeon reiydelle 
wrote:

> I think you may have a vicious circle of errors: because your data is not
> properly replicated to the neighbour, it is not replicating to the
> secondary data center (yeah, obvious). I would suspect the GC errors are
> (also obviously) the result of a backlog of compactions that take out the
> neighbour (assuming replication of 3, that means each "neighbour" is
> participating in compaction from at least one other node besides the
> primary you are looking at (and can of course be much more, depending on
> e.g. vnode count if used).
>
> What happens is that when a node fails due to a GC error (can't reclaim
> space), that causes a cascade of other errors, as you see. Might I suggest
> you have someone in devops with monitoring experience install a monitoring
> tool that will notify you of EVERY SINGLE java GC failure event? Your
> DevOps team may have a favorite log shipping/monitoring tool, could use
> e.g. Puppet
>
> I think you may have to go through a MANUAL, table by table compaction.
>
>
>
>
>
> *...*
>
>
>
>
>
>
> *“Life should not be a journey to the grave with the intention of arriving
> safely in apretty and well preserved body, but rather to skid in broadside
> in a cloud of smoke,thoroughly used up, totally worn out, and loudly
> proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
> (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Wed, Feb 25, 2015 at 11:01 AM, Ja Sam  wrote:
>
>> Hi Roni,
>> The repair results is following (we run it Friday): Cannot proceed on
>> repair because a neighbor (/192.168.61.201) is dead: session failed
>>
>> But to be honest the neighbor did not died. It seemed to trigger a
>> series of full GC events on the initiating node. The results form logs
>> are:
>>
>> [2015-02-20 16:47:54,884] Starting repair command #2, repairing 7 ranges
>> for keyspace prem_maelstrom_2 (parallelism=PARALLEL, full=false)
>> [2015-02-21 02:21:55,640] Lost notification. You should check server log
>> for repair status of keyspace prem_maelstrom_2
>> [2015-02-21 02:22:55,642] Lost notification. You should check server log
>> for repair status of keyspace prem_maelstrom_2
>> [2015-02-21 02:23:55,642] Lost notification. You should check server log
>> for repair status of keyspace prem_maelstrom_2
>> [2015-02-21 02:24:55,644] Lost notification. You should check server log
>> for repair status of keyspace prem_maelstrom_2
>> [2015-02-21 04:41:08,607] Repair session
>> d5d01dd0-b917-11e4-bc97-e9a66e5b2124 for range
>> (85070591730234615865843651857942052874,102084710076281535261119195933814292480]
>> failed with error org.apache.cassandra.exceptions.RepairException: [repair
>> #d5d01dd0-b917-11e4-bc97-e9a66e5b2124 on prem_maelstrom_2/customer_events,
>> (85070591730234615865843651857942052874,102084710076281535261119195933814292480]]
>> Sync failed between /192.168.71.196 and /192.168.61.199
>> [2015-02-21 04:41:08,608] Repair session
>> eb8d8d10-b967-11e4-bc97-e9a66e5b2124 for range
>> (68056473384187696470568107782069813248,85070591730234615865843651857942052874]
>> failed with error java.io.IOException: Endpoint /192.168.61.199 died
>> [2015-02-21 04:41:08,608] Repair session
>> c48aef00-b971-11e4-bc97-e9a66e5b2124 for range (0,10] failed with error
>> java.io.IOException: Cannot proceed on repair because a neighbor (/
>> 192.168.61.201) is dead: session failed
>> [2015-02-21 04:41:08,609] Repair session
>> c48d38f0-b971-11e4-bc97-e9a66e5b2124 for range
>> (42535295865117307932921825928971026442,68056473384187696470568107782069813248]
>> failed with error java.io.IOException: Cannot proceed on repair because a
>> neighbor (/192.168.61.201) is dead: session failed
>> [2015-02-21 04:41:08,609] Repair session
>> c48d38f1-b971-11e4-bc97-e9a66e5b2124 for range
>> (127605887595351923798765477786913079306,136112946768375392941136215564139626496]
>> failed with error java.io.IOException: Cannot proceed on r

Re: Possible problem with disk latency

2015-02-25 Thread Ja Sam
Hi,
One more thing. Hinted Handoff for last week for all nodes was less than 5.
For me every READ is a problem because it must open too many files (3
SSTables), which occurs as an error in reads, repairs, etc.
Regards
Piotrek

On Wed, Feb 25, 2015 at 8:32 PM, Ja Sam  wrote:

> Hi,
> It is not obvious, because data is replicated to second data center. We
> check it "manually" for random records we put into Cassandra and we find
> all of them in secondary DC.
> We know about every single GC failure, but this doesn't change anything.
> The problem with GC failure is only one: restart the node. For few days we
> do not have GC errors anymore. It looks for me like memory leaks.
> We use Chef.
>
> By MANUAL compaction you mean running nodetool compact?  What does it
> change to permanently running compactions?
>
> Regards
> Piotrek
>
> On Wed, Feb 25, 2015 at 8:13 PM, daemeon reiydelle 
> wrote:
>
>> I think you may have a vicious circle of errors: because your data is not
>> properly replicated to the neighbour, it is not replicating to the
>> secondary data center (yeah, obvious). I would suspect the GC errors are
>> (also obviously) the result of a backlog of compactions that take out the
>> neighbour (assuming replication of 3, that means each "neighbour" is
>> participating in compaction from at least one other node besides the
>> primary you are looking at (and can of course be much more, depending on
>> e.g. vnode count if used).
>>
>> What happens is that when a node fails due to a GC error (can't reclaim
>> space), that causes a cascade of other errors, as you see. Might I suggest
>> you have someone in devops with monitoring experience install a monitoring
>> tool that will notify you of EVERY SINGLE java GC failure event? Your
>> DevOps team may have a favorite log shipping/monitoring tool, could use
>> e.g. Puppet
>>
>> I think you may have to go through a MANUAL, table by table compaction.
>>
>>
>>
>>
>>
>> *...*
>>
>>
>>
>>
>>
>>
>> *“Life should not be a journey to the grave with the intention of
>> arriving safely in apretty and well preserved body, but rather to skid in
>> broadside in a cloud of smoke,thoroughly used up, totally worn out, and
>> loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M.
>> ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0)
>> 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Wed, Feb 25, 2015 at 11:01 AM, Ja Sam  wrote:
>>
>>> Hi Roni,
>>> The repair results is following (we run it Friday): Cannot proceed on
>>> repair because a neighbor (/192.168.61.201) is dead: session failed
>>>
>>> But to be honest the neighbor did not died. It seemed to trigger a
>>> series of full GC events on the initiating node. The results form logs
>>> are:
>>>
>>> [2015-02-20 16:47:54,884] Starting repair command #2, repairing 7 ranges
>>> for keyspace prem_maelstrom_2 (parallelism=PARALLEL, full=false)
>>> [2015-02-21 02:21:55,640] Lost notification. You should check server log
>>> for repair status of keyspace prem_maelstrom_2
>>> [2015-02-21 02:22:55,642] Lost notification. You should check server log
>>> for repair status of keyspace prem_maelstrom_2
>>> [2015-02-21 02:23:55,642] Lost notification. You should check server log
>>> for repair status of keyspace prem_maelstrom_2
>>> [2015-02-21 02:24:55,644] Lost notification. You should check server log
>>> for repair status of keyspace prem_maelstrom_2
>>> [2015-02-21 04:41:08,607] Repair session
>>> d5d01dd0-b917-11e4-bc97-e9a66e5b2124 for range
>>> (85070591730234615865843651857942052874,102084710076281535261119195933814292480]
>>> failed with error org.apache.cassandra.exceptions.RepairException: [repair
>>> #d5d01dd0-b917-11e4-bc97-e9a66e5b2124 on prem_maelstrom_2/customer_events,
>>> (85070591730234615865843651857942052874,102084710076281535261119195933814292480]]
>>> Sync failed between /192.168.71.196 and /192.168.61.199
>>> [2015-02-21 04:41:08,608] Repair session
>>> eb8d8d10-b967-11e4-bc97-e9a66e5b2124 for range
>>> (68056473384187696470568107782069813248,85070591730234615865843651857942052874]
>>> failed with error java.io.IOException: Endpoint /192.168.61.199 died
>>> [2015-02-21 04:41:08,608] Repair session
>>> c48aef00-b971-11e4-bc97-e9a66e5b2124 for range (0,10] failed with error
>>> java.io.IOException: Cannot proceed on repair because a neighbor (/
>

Re: Possible problem with disk latency

2015-02-26 Thread Ja Sam
We did this query, most our files are less than 100MB.

Our heap setting are like (they are calculatwed using scipr in
cassandra.env):
MAX_HEAP_SIZE="8GB"
HEAP_NEWSIZE="2GB"
which is maximum recommended by DataStax.

What values do you think we should try?





On Thu, Feb 26, 2015 at 10:06 AM, Roland Etzenhammer <
r.etzenham...@t-online.de> wrote:

> Hi Piotrek,
>
> your disks are mostly idle as far as I can see (the one with 17% busy
> isn't that high on load). One thing came up to my mind did you look on
> the sizes of your sstables? I did this with something like
>
> find /var/lib/cassandra/data -type f -size -1k -name "*Data.db" | wc
> find /var/lib/cassandra/data -type f -size -10k -name "*Data.db" | wc
> find /var/lib/cassandra/data -type f -size -100k -name "*Data.db" | wc
> ...
> find /var/lib/cassandra/data -type f -size -100k -name "*Data.db" | wc
>
> Your count is growing from opscenter - and if there are many really
> small tables I would guess you are running out of heap. If memory
> pressure is high it is likely that there will be much flushes of
> memtables to disk with many small files - had this once. You can
> increase heap in cassandra-env.sh, but be careful.
>
> Best regards,
> Roland
>
>


Re: Possible problem with disk latency

2015-02-26 Thread Ja Sam
Hi, Ron
I look deep into my cassandra files and SSTables created during last day
are less than 20MB.

Piotrek

p.s. Your tips are really useful at least I am starting to finding where
exactly the problem is.

On Thu, Feb 26, 2015 at 3:11 PM, Ja Sam  wrote:

> We did this query, most our files are less than 100MB.
>
> Our heap setting are like (they are calculatwed using scipr in
> cassandra.env):
> MAX_HEAP_SIZE="8GB"
> HEAP_NEWSIZE="2GB"
> which is maximum recommended by DataStax.
>
> What values do you think we should try?
>
>
>
>
>
> On Thu, Feb 26, 2015 at 10:06 AM, Roland Etzenhammer <
> r.etzenham...@t-online.de> wrote:
>
>> Hi Piotrek,
>>
>> your disks are mostly idle as far as I can see (the one with 17% busy
>> isn't that high on load). One thing came up to my mind did you look on
>> the sizes of your sstables? I did this with something like
>>
>> find /var/lib/cassandra/data -type f -size -1k -name "*Data.db" | wc
>> find /var/lib/cassandra/data -type f -size -10k -name "*Data.db" | wc
>> find /var/lib/cassandra/data -type f -size -100k -name "*Data.db" | wc
>> ...
>> find /var/lib/cassandra/data -type f -size -100k -name "*Data.db" | wc
>>
>> Your count is growing from opscenter - and if there are many really
>> small tables I would guess you are running out of heap. If memory
>> pressure is high it is likely that there will be much flushes of
>> memtables to disk with many small files - had this once. You can
>> increase heap in cassandra-env.sh, but be careful.
>>
>> Best regards,
>> Roland
>>
>>
>


Re: Possible problem with disk latency

2015-02-26 Thread Ja Sam
Hi,
I found many simmilar lines in log:

INFO  [SlabPoolCleaner] 2015-02-24 12:28:19,557 ColumnFamilyStore.java:850
- Enqueuing flush of customer_events: 95299485 (5%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:1465] 2015-02-24 12:28:19,569 Memtable.java:339
- Writing Memtable-customer_events@2011805231(25791893 serialized bytes,
130225 ops, 5%/0% of on/off-heap limit)
INFO  [MemtableFlushWriter:1465] 2015-02-24 12:28:20,731 Memtable.java:378
- Completed flushing
/grid/data04/cassandra/data/prem_maelstrom/customer_events-dbf0f26031ff11e4bb64b7d2603bc25a/prem_maelstrom-customer_events-ka-207535-Data.db
(9551014 bytes) for commitlog position
ReplayPosition(segmentId=1424694746404, position=418223)


Which means that somehow when memtable is around 100 MB (95299485) - 5%of
heap  it starts to be written on disk. During conversion to SSTable the
compression is applied and result size is like 10% of original ->
~10MB(9551014). Am I right?

The question of course are:
1) why so small memtables are flushed to disk if I
configured commitlog_total_space_in_mb to 1024?
2) why this small SSTables are not compacted? The compaction process still
takes bigger SSTables? Maybe I should change compaction strategy?


On Thu, Feb 26, 2015 at 5:24 PM, Roland Etzenhammer <
r.etzenham...@t-online.de> wrote:

> Hi,
>
> 8GB Heap is a good value already - going above 8GB will often result in
> noticeable gc pause times in java, but you can give 12G a try just to see
> if that helps (and turn it back down again). You can add a "Heap Used"
> graph in opscenter to get a quick overview of your heap state.
>
> Best regards,
> Roland
>
>
>