Re: Timeout for only one keyspace in cluster

2018-07-21 Thread Ben Slater
Note that that writetimeout exception can be C*s way of telling you when
there is contention on a LWT (rather than actually timing out). See
https://issues.apache.org/jira/browse/CASSANDRA-9328

Cheers
Ben

On Sun, 22 Jul 2018 at 11:20 Goutham reddy 
wrote:

> Hi,
> As it is a single partition key, try to update the key with only partition
> key instead of passing other columns. And try to set consistency level ONE.
>
> Cheers,
> Goutham.
>
> On Fri, Jul 20, 2018 at 6:57 AM learner dba 
> wrote:
>
>> Anybody has any ideas about this? This is happening in production and we
>> really need to fix it.
>>
>> On Thursday, July 19, 2018, 10:41:59 AM CDT, learner dba
>>  wrote:
>>
>>
>> Our foreignid is unique idetifier and we did check for wide partitions;
>> cfhistorgrams show all partitions are evenly sized:
>>
>> Percentile  SSTables Write Latency  Read LatencyPartition
>> SizeCell Count
>>
>>   (micros)  (micros)   (bytes)
>>
>>
>> 50% 0.00 29.52  0.00
>> 191612
>>
>> 75% 0.00 42.51  0.00
>> 229912
>>
>> 95% 0.00 61.21  0.00
>> 275914
>>
>> 98% 0.00 73.46  0.00
>> 275917
>>
>> 99% 0.00 88.15  0.00
>> 275917
>>
>> Min 0.00  9.89  0.00   150
>> 2
>>
>> Max 0.00 88.15  0.00   7007506
>> 42510
>> any thing else that we can check?
>>
>> On Wednesday, July 18, 2018, 10:44:29 PM CDT, wxn...@zjqunshuo.com <
>> wxn...@zjqunshuo.com> wrote:
>>
>>
>> Your partition key is foreignid. You may have a large partition. Why not
>> use foreignid+timebucket as partition key?
>>
>>
>> *From:* learner dba 
>> *Date:* 2018-07-19 01:48
>> *To:* User cassandra.apache.org 
>> *Subject:* Timeout for only one keyspace in cluster
>> Hi,
>>
>> We have a cluster with multiple keyspaces. All queries are performing
>> good but write operation on few tables in one specific keyspace gets write
>> timeout. Table has counter column and counter update query times out
>> always. Any idea?
>>
>> CREATE TABLE x.y (
>>
>> foreignid uuid,
>>
>> timebucket text,
>>
>> key text,
>>
>> timevalue int,
>>
>> value counter,
>>
>> PRIMARY KEY (foreignid, timebucket, key, timevalue)
>>
>> ) WITH CLUSTERING ORDER BY (timebucket ASC, key ASC, timevalue ASC)
>>
>> AND bloom_filter_fp_chance = 0.01
>>
>> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>
>> AND comment = ''
>>
>> AND compaction = {'class':
>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>> 'max_threshold': '32', 'min_threshold': '4'}
>>
>> AND compression = {'chunk_length_in_kb': '64', 'class':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>
>> AND crc_check_chance = 1.0
>>
>> AND dclocal_read_repair_chance = 0.1
>>
>> AND default_time_to_live = 0
>>
>> AND gc_grace_seconds = 864000
>>
>> AND max_index_interval = 2048
>>
>> AND memtable_flush_period_in_ms = 0
>>
>> AND min_index_interval = 128
>>
>> AND read_repair_chance = 0.0
>>
>> AND speculative_retry = '99PERCENTILE';
>>
>> Query and Error:
>>
>> UPDATE x.y SET value = value + 1 where foreignid = ? AND timebucket = ? AND 
>> key = ? AND timevalue = ?, err = {s:\"gocql: no response 
>> received from cassandra within timeout period
>>
>>
>> I verified CL=local_serial
>>
>> We had been working on this issue for many days; any help will be much 
>> appreciated.
>>
>>
>>
>> --
> Regards
> Goutham Reddy
>
-- 


*Ben Slater*

*Chief Product Officer *

   


Read our latest technical blog posts here
.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


Re: Timeout for only one keyspace in cluster

2018-07-21 Thread Goutham reddy
Hi,
As it is a single partition key, try to update the key with only partition
key instead of passing other columns. And try to set consistency level ONE.

Cheers,
Goutham.

On Fri, Jul 20, 2018 at 6:57 AM learner dba 
wrote:

> Anybody has any ideas about this? This is happening in production and we
> really need to fix it.
>
> On Thursday, July 19, 2018, 10:41:59 AM CDT, learner dba
>  wrote:
>
>
> Our foreignid is unique idetifier and we did check for wide partitions;
> cfhistorgrams show all partitions are evenly sized:
>
> Percentile  SSTables Write Latency  Read LatencyPartition Size
>   Cell Count
>
>   (micros)  (micros)   (bytes)
>
>
> 50% 0.00 29.52  0.00  1916
>   12
>
> 75% 0.00 42.51  0.00  2299
>   12
>
> 95% 0.00 61.21  0.00  2759
>   14
>
> 98% 0.00 73.46  0.00  2759
>   17
>
> 99% 0.00 88.15  0.00  2759
>   17
>
> Min 0.00  9.89  0.00   150
> 2
>
> Max 0.00 88.15  0.00   7007506
> 42510
> any thing else that we can check?
>
> On Wednesday, July 18, 2018, 10:44:29 PM CDT, wxn...@zjqunshuo.com <
> wxn...@zjqunshuo.com> wrote:
>
>
> Your partition key is foreignid. You may have a large partition. Why not
> use foreignid+timebucket as partition key?
>
>
> *From:* learner dba 
> *Date:* 2018-07-19 01:48
> *To:* User cassandra.apache.org 
> *Subject:* Timeout for only one keyspace in cluster
> Hi,
>
> We have a cluster with multiple keyspaces. All queries are performing good
> but write operation on few tables in one specific keyspace gets write
> timeout. Table has counter column and counter update query times out
> always. Any idea?
>
> CREATE TABLE x.y (
>
> foreignid uuid,
>
> timebucket text,
>
> key text,
>
> timevalue int,
>
> value counter,
>
> PRIMARY KEY (foreignid, timebucket, key, timevalue)
>
> ) WITH CLUSTERING ORDER BY (timebucket ASC, key ASC, timevalue ASC)
>
> AND bloom_filter_fp_chance = 0.01
>
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>
> AND comment = ''
>
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>
> AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>
> AND crc_check_chance = 1.0
>
> AND dclocal_read_repair_chance = 0.1
>
> AND default_time_to_live = 0
>
> AND gc_grace_seconds = 864000
>
> AND max_index_interval = 2048
>
> AND memtable_flush_period_in_ms = 0
>
> AND min_index_interval = 128
>
> AND read_repair_chance = 0.0
>
> AND speculative_retry = '99PERCENTILE';
>
> Query and Error:
>
> UPDATE x.y SET value = value + 1 where foreignid = ? AND timebucket = ? AND 
> key = ? AND timevalue = ?, err = {s:\"gocql: no response 
> received from cassandra within timeout period
>
>
> I verified CL=local_serial
>
> We had been working on this issue for many days; any help will be much 
> appreciated.
>
>
>
> --
Regards
Goutham Reddy


Re: which driver to use with cassandra 3

2018-07-21 Thread Goutham reddy
Hi,
Consider overriding default java driver provided by spring boot if you are
using Datastax clusters with with any of the 3.X Datastax driver. I agree
to Patrick, always have one key space specified to one application in that
way you achieve domain driven applications and cause less overhead avoiding
switching between key spaces.

Cheers,
Goutham

On Fri, Jul 20, 2018 at 10:10 AM Patrick McFadin  wrote:

> Vitaliy,
>
> The DataStax Java driver is very actively maintained by a good size team
> and a lot of great community contributors. It's version 3.x compatible and
> even has some 4.x features starting to creep in. Support for virtual tables
> (https://issues.apache.org/jira/browse/CASSANDRA-7622)  was just merged
> as an example. Even the largest DataStax customers have a mix of enterprise
> + OSS and we want to support them either way. Giving developers the most
> consistent experience is part of that goal.
>
> As for spring-data-cassandra, it does pull the latest driver as a part of
> its own build, so you will already have it in your classpath. Spring adds
> some auto-magic that you should be aware. The part you mentioned about the
> schema management, is one to be careful with using. If you use it in dev,
> it's not a huge problem. If it gets out to prod, you could potentially have
> A LOT of concurrent schema changes happening which can lead to bad things.
> Also, some of the spring API features such as findAll() can expose
> typical c* anti-patterns such as "allow filtering" Just be aware of what
> feature does what. And finally, another potential production problem is
> that if you use a lot of keyspaces, Spring will instantiate a new Driver
> Session object per keyspace which can lead to a lot of redundant connection
> to the database. From the driver, a better way is to specify a keyspace per
> query.
>
> As you are using spring-data-cassandra, please share your experiences if
> you can. There are a lot of developers that would benefit from some
> real-world stories.
>
> Patrick
>
>
> On Fri, Jul 20, 2018 at 4:54 AM Vitaliy Semochkin 
> wrote:
>
>> Thank you very much Duy Hai Doan!
>> I have relatively simple demands and since spring using datastax
>> driver I can always get back to it,
>> though  I would prefer to use spring in order to do bootstrapping and
>> resource management for me.
>> On Fri, Jul 20, 2018 at 4:51 PM DuyHai Doan  wrote:
>> >
>> > Spring data cassandra is so so ... It has less features (at last at the
>> time I looked at it) than the default Java driver
>> >
>> > For driver, right now most of people are using Datastax's ones
>> >
>> > On Fri, Jul 20, 2018 at 3:36 PM, Vitaliy Semochkin <
>> vitaliy...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Which driver to use with cassandra 3
>> >>
>> >> the one that is provided by datastax, netflix or something else.
>> >>
>> >> Spring uses driver from datastax, though is it a reliable solution for
>> >> a long term project, having in mind that datastax and cassandra
>> >> parted?
>> >>
>> >> Regards,
>> >> Vitaliy
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> >> For additional commands, e-mail: user-h...@cassandra.apache.org
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
Regards
Goutham Reddy


Stumped By Cassandra delays

2018-07-21 Thread Gareth Collins
Hello,

We are running Cassandra 2.1.14 in AWS, with c5.4xlarge machines
(initially these were m4.xlarge) for our cassandra servers and
m4.xlarge for our application servers. On one of our clusters having
problems we have 6 C* nodes and 6 AS nodes (two nodes for C*/AS in
each availability zone).

In the deployed application it seems to be a common use-case to one of
the following. These use cases are having periodic errors:
(1) Copy one Cassandra table to another table using the application server.
(2) Export from a Cassandra table to file using the application server.

The application server is reading from the table via token range, the
token range queries being calculated to ensure the whole token range
for a query falls on the same node. i.e. the query looks like this:

select * from  where token(key) > ? and token(key) <= ?

This was probably initially done on the assumption that the driver
would be able to figure out which nodes contained the data. As we
realized now the driver only supports routing to the right node if the
partition key is defined in the where clause.

When we do the read we are doing a lot of queries in parallel to
maximize performance. I believe when the copy is being run there are
currently 5 threads per machine doing the copy for a max of 30
concurrent read requests across the cluster.

Specifically these tasks been periodically having a few of these errors:

INFO  [ScheduledTasks:1] 2018-07-13 20:03:20,124
MessagingService.java:929 - REQUEST_RESPONSE messages were dropped in
last 5000 ms: 1 for internal timeout and 0 for cross node timeout

Which are causing errors in the read by token range queries.

Running "nodetool settraceprobability 1" and running the test when
failing we could see that this timeout would occur when using a
coordinator on the read query (i.e. the co-ordinator sent the message
but didn't get a response to the query from the other node within the
time limit). We were seeing these timeouts periodically even if we set
the timeouts to 60 seconds.

As I mentioned at the beginning we had initially been using m4.xlarge
for our Cassandra servers. After discussion with AWS it was suggested
that we could be hitting performance limits (i.e. either network or
disk - I believe more likely network as I didn't see the disk getting
hit very hard) so we upgraded the Cassandra servers and everything was
fine for a while.

But then the problems started to re-occur recently...pretty
consistently failing on these copy or export jobs running overnight.
Having looked at resource usage statistics graphs it appeared that the
C* servers were not heavily loaded at all (the app servers were being
maxed out) and I did not see any significant garbage collections in
the logs that could explain the delays.

As a last resort I decided to turn up the logging on the server and
client, datastax client set to debug and server set to the following
logs via nodetool...the goal being to maximize logging while cutting
out the very verbose stuff (e.g. Message.java appears to print out the
whole message in 2.1.14 when put into debug -> it looks like that was
moved to trace in a later 2.1.x release):

bin/nodetool setlogginglevel org.apache.cassandra.tracing.Tracing INFO
bin/nodetool setlogginglevel org.apache.cassandra.transport.Message INFO
bin/nodetool setlogginglevel org.apache.cassandra.db.ColumnFamilyStore DEBUG
bin/nodetool setlogginglevel org.apache.cassandra.gms.Gossiper DEBUG
bin/nodetool setlogginglevel
org.apache.cassandra.db.filter.SliceQueryFilter DEBUG
bin/nodetool setlogginglevel
org.apache.cassandra.service.pager.AbstractQueryPager INFO
bin/nodetool setlogginglevel org.apache.cassandra TRACE

Of course when we did this (as part of turning on the logging the
application servers were restarted) the problematic export to file
jobs which had failed every time for the last week succeeded and ran
much faster than they had run usually (47 minutes vs 1 1/2 hours) so I
decided to look for the biggest delay (which turned out to be ~9
seconds and see what I could find in the log - outside of this time,
the response times were up to perhaps 20ms). Here is what I found:

(1) Only one Cassandra node had delays at a time.

(2) On the Cassandra node that did had delays there was no significant
information from the GCInspector (the system stopped processing client
requests between 05:32:33 - 05:32:43). If anything it confirmed my
belief that the system was lightly loaded

DEBUG [Service Thread] 2018-07-20 05:32:25,559 GCInspector.java:260 -
ParNew GC in 8ms.  CMS Old Gen: 2879689192 -> 2879724544; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 8701480 -> 12462936
DEBUG [Service Thread] 2018-07-20 05:32:26,792 GCInspector.java:260 -
ParNew GC in 9ms.  CMS Old Gen: 2879724544 -> 2879739504; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 12462936 -> 9463872
DEBUG [Service Thread] 2018-07-20 05:32:29,227 GCInspector.java:260 -
ParNew GC in 9ms.  CMS Old Gen: 2879739504 -> 2879755872; Par Eden