Re: How to install an older minor release?

2019-04-03 Thread Kyrylo Lebediev
Hi Oleksandr,
Yes, that was always the case. All older versions are removed from Debian repo 
index :(

From: Oleksandr Shulgin 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, April 2, 2019 at 20:04
To: User 
Subject: How to install an older minor release?

Hello,

We've just noticed that we cannot install older minor releases of Apache 
Cassandra from Debian packages, as described on this page: 
http://cassandra.apache.org/download/

Previously we were doing the following at the last step: apt-get install 
cassandra==3.0.17

Today it fails with error:
E: Version '3.0.17' for 'cassandra' was not found

And `apt-get show cassandra` reports only one version available, the latest 
released one: 3.0.18
The packages for the older versions are still in the pool: 
http://dl.bintray.com/apache/cassandra/pool/main/c/cassandra/

Was it always the case that only the latest version is available to be 
installed directly with apt or did something change recently?

Regards,
--
Alex



Re: New node insertion methods

2019-03-12 Thread Kyrylo Lebediev
Hi Vsevolod,

> Are there any workarounds to speed up the process? (e.g. doing cleanup only 
> after all 4 new nodes joined cluster), or inserting multiple nodes 
> simultaneously with specific settings?
e.g. doing cleanup only after all 4 new nodes joined cluster === allowed
inserting multiple nodes simultaneously with specific settings  generally, 
possible, but not recommended. Add/remove nodes one by one if needed. See 
http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html for 
details
also you may play with stream_throughput_outbound_megabits_per_sec parameter in 
cassandra.yaml

> How do people with tens of Cassandra nodes perform insertion/extraction of 
> new/failed nodes? What's the usual routine in case you have 20+ nodes, and 
> need to decommission 4 nodes and insert 4 new ones instead?
It’s better to replace existing node w/o streaming, just copying data by rsync 
/ attaching EBS volume to new server etc 
http://thelastpickle.com/blog/2018/02/21/replace-node-without-bootstrapping.html

Regards,
Kyrill


From: Vsevolod Filaretov 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, March 12, 2019 at 11:01
To: "user@cassandra.apache.org" 
Subject: New node insertion methods

Hello everyone!

We have a cluster of 4 nodes, 4.5 tb/data per node, and are in the middle of 
adding 4 more nodes to the cluster.
Joining a new node based on official guidelines/helps (setup cassandra on a new 
node, start cassandra instance, wait until node goes from JOINING state to 
NORMAL, perform nodetool cleanup) takes approximately 7 days in our case, per 
single node.

Questions:
1)
Are there any workarounds to speed up the process? (e.g. doing cleanup only 
after all 4 new nodes joined cluster), or inserting multiple nodes 
simultaneously with specific settings?
2)
How do people with tens of Cassandra nodes perform insertion/extraction of 
new/failed nodes? What's the usual routine in case you have 20+ nodes, and need 
to decommission 4 nodes and insert 4 new ones instead?
Links to blog posts or mail threads much appreciated!

Thank you all in advance,
Vsevolod Filaretov.


Re: Maximum memory usage reached

2019-03-07 Thread Kyrylo Lebediev
Got it.
Thank you for helping me, Jon, Jeff!

> Is there a reason why you’re picking Cassandra for this dataset?
Decision wasn’t made by myself, I guess C* was chosen because some huge growth 
was planned.

Regards,
Kyrill

From: Jeff Jirsa 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, March 6, 2019 at 22:19
To: "user@cassandra.apache.org" 
Subject: Re: Maximum memory usage reached

Also, that particular logger is for the internal chunk / page cache. If it 
can’t allocate from within that pool, it’ll just use a normal bytebuffer.

It’s not really a problem, but if you see performance suffer, upgrade to latest 
3.11.4, there was a bit of a perf improvement in the case where that cache 
fills up.

--
Jeff Jirsa


On Mar 6, 2019, at 11:40 AM, Jonathan Haddad 
mailto:j...@jonhaddad.com>> wrote:
That’s not an error. To the left of the log message is the severity, level INFO.

Generally, I don’t recommend running Cassandra on only 2GB ram or for small 
datasets that can easily fit in memory. Is there a reason why you’re picking 
Cassandra for this dataset?

On Thu, Mar 7, 2019 at 8:04 AM Kyrylo Lebediev 
mailto:klebed...@conductor.com>> wrote:
Hi All,

We have a tiny 3-node cluster
C* version 3.9 (I know 3.11 is better/stable, but can’t upgrade immediately)
HEAP_SIZE is 2G
JVM options are default
All setting in cassandra.yaml are default (file_cache_size_in_mb not set)

Data per node – just ~ 1Gbyte

We’re getting following errors messages:

DEBUG [CompactionExecutor:87412] 2019-03-06 11:00:13,545 
CompactionTask.java:150 - Compacting (ed4a4d90-4028-11e9-adc0-230e0d6622df) 
[/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23248-big-Data.db:level=0,
 
/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23247-big-Data.db:level=0,
 
/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23246-big-Data.db:level=0,
 
/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23245-big-Data.db:level=0,
 ]
DEBUG [CompactionExecutor:87412] 2019-03-06 11:00:13,582 
CompactionTask.java:230 - Compacted (ed4a4d90-4028-11e9-adc0-230e0d6622df) 4 
sstables to 
[/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23249-big,]
 to level=0.  6.264KiB to 1.485KiB (~23% of original) in 36ms.  Read Throughput 
= 170.754KiB/s, Write Throughput = 40.492KiB/s, Row Throughput = ~106/s.  194 
total partitions merged to 44.  Partition merge counts were {1:18, 4:44, }
INFO  [IndexSummaryManager:1] 2019-03-06 11:00:22,007 
IndexSummaryRedistribution.java:75 - Redistributing index summaries
INFO  [pool-1-thread-1] 2019-03-06 11:11:24,903 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
INFO  [pool-1-thread-1] 2019-03-06 11:26:24,926 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
INFO  [pool-1-thread-1] 2019-03-06 11:41:25,010 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
INFO  [pool-1-thread-1] 2019-03-06 11:56:25,018 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB

What’s interesting that “Maximum memory usage reached” messages appears each 15 
minutes.
Reboot temporary solve the issue, but it then appears again after some time

Checked, there are no huge partitions (max partition size is ~2Mbytes )

How such small amount of data may cause this issue?
How to debug this issue further?


Regards,
Kyrill


--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


Maximum memory usage reached

2019-03-06 Thread Kyrylo Lebediev
Hi All,

We have a tiny 3-node cluster
C* version 3.9 (I know 3.11 is better/stable, but can’t upgrade immediately)
HEAP_SIZE is 2G
JVM options are default
All setting in cassandra.yaml are default (file_cache_size_in_mb not set)

Data per node – just ~ 1Gbyte

We’re getting following errors messages:

DEBUG [CompactionExecutor:87412] 2019-03-06 11:00:13,545 
CompactionTask.java:150 - Compacting (ed4a4d90-4028-11e9-adc0-230e0d6622df) 
[/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23248-big-Data.db:level=0,
 
/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23247-big-Data.db:level=0,
 
/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23246-big-Data.db:level=0,
 
/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23245-big-Data.db:level=0,
 ]
DEBUG [CompactionExecutor:87412] 2019-03-06 11:00:13,582 
CompactionTask.java:230 - Compacted (ed4a4d90-4028-11e9-adc0-230e0d6622df) 4 
sstables to 
[/cassandra/data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/mc-23249-big,]
 to level=0.  6.264KiB to 1.485KiB (~23% of original) in 36ms.  Read Throughput 
= 170.754KiB/s, Write Throughput = 40.492KiB/s, Row Throughput = ~106/s.  194 
total partitions merged to 44.  Partition merge counts were {1:18, 4:44, }
INFO  [IndexSummaryManager:1] 2019-03-06 11:00:22,007 
IndexSummaryRedistribution.java:75 - Redistributing index summaries
INFO  [pool-1-thread-1] 2019-03-06 11:11:24,903 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
INFO  [pool-1-thread-1] 2019-03-06 11:26:24,926 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
INFO  [pool-1-thread-1] 2019-03-06 11:41:25,010 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
INFO  [pool-1-thread-1] 2019-03-06 11:56:25,018 NoSpamLogger.java:91 - Maximum 
memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB

What’s interesting that “Maximum memory usage reached” messages appears each 15 
minutes.
Reboot temporary solve the issue, but it then appears again after some time

Checked, there are no huge partitions (max partition size is ~2Mbytes )

How such small amount of data may cause this issue?
How to debug this issue further?


Regards,
Kyrill




RE: Java 11 support in Cassandra 4.0 + Early Testing and Feedback

2018-09-07 Thread Kyrylo Lebediev
As many people use Oracle JDK, I think it worth mentioning that according 
Oracle Support Roadmap there are some changes in their policies for Java 11 and 
above (http://www.oracle.com/technetwork/java/eol-135779.html).

In particular:
“Starting with Java SE 9, in addition to providing Oracle JDK for free under 
the BCL, Oracle 
also started providing builds of OpenJDK under an 
open source license (similar to 
that of Linux). Oracle is working to make the Oracle JDK and OpenJDK builds 
from Oracle 
interchangeable
 - targeting developers and organizations that do not want commercial support 
or enterprise management tools. Beginning with Oracle Java SE 11 (18.9 LTS), 
the Oracle JDK will continue to be available royalty-free for development, 
testing, prototyping or demonstrating purposes. As announced in September 
2017,
 with the OracleJDK and builds of Oracle OpenJDK being interchangeable for 
releases of Java SE 11 and later, the Oracle JDK will primarily be for 
commercial and support customers and OpenJDK builds from Oracle are for those 
who do not want commercial support or enterprise management tools.”

What these statements mean for Cassandra users in terms of which JDK (there are 
several OSS alternatives available) is the best to use in case of absence of 
active Oracle/Java Subscription?

Regards,
Kyrill

From: Jonathan Haddad 
Sent: Thursday, August 16, 2018 9:02 PM
To: user 
Subject: Java 11 support in Cassandra 4.0 + Early Testing and Feedback

Hey folks,

As we start to get ready to feature freeze trunk for 4.0, it's going to be 
important to get a lot of community feedback.  This is going to be a big 
release for a number of reasons.

* Virtual tables.  Finally a nice way of querying for system metrics & status
* Streaming optimizations 
(https://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html)
* Groundwork for strongly consistent schema changes
* Optimizations to internode communcation
* Experimental support for Java 11

I (and many others) would like Cassandra to be rock solid on day one if its 
release.  The best way to ensure that happens is if people provide feedback.  
One of the areas we're going to need a lot of feedback on is on how things work 
with Java 11, especially if you have a way of simulating a real world workload 
on a staging cluster.  I've written up instructions here on how to start 
testing: http://thelastpickle.com/blog/2018/08/16/java11.html

Java 11 hasn't been released yet, but that doesn't mean it's not a good time to 
test.  Any bugs we can identify now will help us get to a stable release 
faster.  If you rely on Cassandra for your business, please take some time to 
participate in the spirit of OSS by helping test & provide feedback to the team.

Thanks everyone!
---
Jon Haddad
Principal Consultant, The Last Pickle


RE: Re:RE: data not beening syncd in different nodes

2018-08-31 Thread Kyrylo Lebediev
TTL 60 seconds - small value (even smaller than compaction window). This means 
that even if all replicas are consistent, data is deleted really quickly so 
that results may differ even for 2 consecutive queries. How about this theory?

CL in your driver - depends on which CL is default for your particular driver.

Regards,
Kyrill

From: David Ni 
Sent: Friday, August 31, 2018 12:53 PM
To: user@cassandra.apache.org
Subject: Re:RE: data not beening syncd in different nodes

Hi Kyrylo
I have already tried consistency quorum and all,still the same result.
the java code write data to cassandra does not set CL,does this mean the 
default CL is one?
the tpstats out is like below ,there are some dropped mutations, but it doesn't 
grow during a very long time
Pool NameActive   Pending  Completed   Blocked  All 
time blocked
MutationStage0 0300637997 0 
0
ViewMutationStage 0 0  0 0  
   0
ReadStage 0 04357929 0  
   0
RequestResponseStage0 0  306954791 0
 0
ReadRepairStage  0 0 472027 0   
  0
CounterMutationStage   0 0  0 0 
0
MiscStage  0 0  0 0 
0
CompactionExecutor 0 0   17976139 0 
0
MemtableReclaimMemory 0 0  53018 0 0
PendingRangeCalculator   0 0 11 0   
  0
GossipStage  0 0   59889799 0   
  0
SecondaryIndexManagement 0 0  0 0 0
HintsDispatcher 0 0  7 0
 0
MigrationStage  0 0101 0
 0
MemtablePostFlush  0 0  41470 0 
0
PerDiskMemtableFlushWriter_0 0 0  52779 0   
  0
ValidationExecutor0 0 80 0  
   0
Sampler0 0  0 0 
0
MemtableFlushWriter   0 0  40301 0  
   0
InternalResponseStage0 0 70 0   
  0
AntiEntropyStage  0 0352 0  
   0
CacheCleanupExecutor0 0  0 0
 0
Native-Transport-Requests   0 0  158242159 0 
13412

Message type  Dropped
READ 0
RANGE_SLICE  0
_TRACE1
HINT  0
MUTATION34
COUNTER_MUTATION 0
BATCH_STORE  0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR  0

and yes,we are inserting data with TTL=60 seconds
we have 200 vehicles and updating this table every 5 or 10 seconds;


At 2018-08-31 17:10:50, "Kyrylo Lebediev" 
mailto:kyrylo_lebed...@epam.com.INVALID>> 
wrote:

Looks like you're querying the table at CL = ONE which is default for cqlsh.
If you run cqlsh on nodeX it doesn't mean you retrieve data from this node. 
What this means is that nodeX will be coordinator, whereas actual data will be 
retrieved from any node, based on token range + dynamic snitch data (which, I 
assume, you use as it's turned on by default).
Which CL you use when you write data?
Try querying using CL = QUORUM or ALL. What's your result in this case?
If you run 'nodetool tpstats' across all the nodes, are there dropped mutations?


As you use TimeWindowCompactionStrategy, do you insert data with TTL?
These buckets seem to be too small for me: 'compaction_window_size': '2', 
'compaction_window_unit': 'MINUTES'.
Do you have such a huge amount of writes so that such bucket size makes sense?

Regards,
Kyrill

From: David Ni mailto:zn1...@126.com>>
Sent: Friday, August 31, 2018 11:39 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: data not beening syncd in different nodes

Hi Experts,
I am using 3.9 cassandra in production environment,we have 6 nodes,the RF 
of keyspace is 3, I have a table which below definition:
 CREATE TABLE nev_prod_tsp.heartbeat (
vin text PRIMARY KEY,
create_time timestamp,
pkg_sn text,
prot_ver text,
trace_time timestamp,
tsp_sn text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

RE: data not beening syncd in different nodes

2018-08-31 Thread Kyrylo Lebediev
Looks like you're querying the table at CL = ONE which is default for cqlsh.
If you run cqlsh on nodeX it doesn't mean you retrieve data from this node. 
What this means is that nodeX will be coordinator, whereas actual data will be 
retrieved from any node, based on token range + dynamic snitch data (which, I 
assume, you use as it's turned on by default).
Which CL you use when you write data?
Try querying using CL = QUORUM or ALL. What's your result in this case?
If you run 'nodetool tpstats' across all the nodes, are there dropped mutations?


As you use TimeWindowCompactionStrategy, do you insert data with TTL?
These buckets seem to be too small for me: 'compaction_window_size': '2', 
'compaction_window_unit': 'MINUTES'.
Do you have such a huge amount of writes so that such bucket size makes sense?

Regards,
Kyrill

From: David Ni 
Sent: Friday, August 31, 2018 11:39 AM
To: user@cassandra.apache.org
Subject: data not beening syncd in different nodes

Hi Experts,
I am using 3.9 cassandra in production environment,we have 6 nodes,the RF 
of keyspace is 3, I have a table which below definition:
 CREATE TABLE nev_prod_tsp.heartbeat (
vin text PRIMARY KEY,
create_time timestamp,
pkg_sn text,
prot_ver text,
trace_time timestamp,
tsp_sn text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = 'heartbeat'
AND compaction = {'class': 
'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
'compaction_window_size': '2', 'compaction_window_unit': 'MINUTES', 
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 60
AND gc_grace_seconds = 120
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.2
AND speculative_retry = '99PERCENTILE';
I am updating this table's data every 5 or 10 seconds,
the data sample  is as below:
cassandra@cqlsh> select * from nev_prod_tsp.heartbeat;

 vin   | create_time | pkg_sn | prot_ver | 
trace_time  | tsp_sn
---+-++--+-+--
 LLXA2A408JA002371 | 2018-08-31 08:16:07.348000+ | 128209 |2 | 
2018-08-31 08:16:09.00+ | ad4c2b13-d894-4804-9cf2-e07a3c5851bd
 LLXA2A400JA002333 | 2018-08-31 08:16:06.386000+ | 142944 |2 | 
2018-08-31 08:16:04.00+ | 6ba9655c-8542-4251-ba9b-93ae7420ecc7
 LLXA2A402JA002351 | 2018-08-31 08:16:09.451000+ | 196040 |2 | 
2018-08-31 08:16:07.00+ | 9b6a5d7d-4917-46bc-a247-8a0ff8c1e699


but we I select all data from this table,the results from different nodes are 
not the same:
node1: about 60 rows
node2:about 140 rows
node3:about 30 rows
node4:about 140 rows
node5:about 70 rows
node6:about 140 rows

I have tried compact and repair on all nodes,but didn't help.

Could any one give me some help?Thanks very much.





Re: Extending Cassandra on AWS from single Region to Multi-Region

2018-08-13 Thread Kyrylo Lebediev
Hi,


There is an instruction how to switch to a different snitch in datastax docs: 
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsSwitchSnitch.html

Personally I haven't tried to change Ec2 to Ec2Multi, but my understanding is 
that as long as 'dc' and 'rack' values are left unchanged for all nodes after 
changing snitch, the shouldn't be any issues. But it's just my guess, you have 
to test this procedure in non-prod env before making this change in prod.


Also, consider using GossipingPropertyFileSnitch which  requires some 
additional configuration in AWS (editing cassandra-racdc.properties file) 
comparing to native AWS snitches, gives more flexibility.


Regards,

Kyrill


From: srinivasarao daruna 
Sent: Friday, August 10, 2018 5:14:30 PM
To: user@cassandra.apache.org
Subject: Re: Extending Cassandra on AWS from single Region to Multi-Region

Hey All,

Any info on this topic.?

Thank You,
Regards,
Srini

On Wed, Aug 8, 2018, 9:46 PM srinivasarao daruna 
mailto:sree.srin...@gmail.com>> wrote:
Hi All,

We have built Cassandra on AWS EC2 instances. Initially when creating cluster 
we have not considered multi-region deployment and we have used AWS EC2Snitch.

We have used EBS Volumes to save our data and each of those disks were filled 
around 350G.
We want to extend it to Multi Region and wanted to know the better approach and 
recommendations to achieve this process.

I agree that we have made a mistake by not using EC2MultiRegionSnitch, but its 
past now and if anyone faced or implemented similar thing i would like to get 
some guidance.

Any help would be very much appreciated.

Thank You,
Regards,
Srini


Re: Compression Tuning Tutorial

2018-08-13 Thread Kyrylo Lebediev
Thank you, Jon!


From: Jonathan Haddad 
Sent: Thursday, August 9, 2018 7:29:24 PM
To: user
Subject: Re: Compression Tuning Tutorial

There's a discussion about direct I/O here you might find interesting: 
https://issues.apache.org/jira/browse/CASSANDRA-14466

I suspect the main reason is that O_DIRECT wasn't added till Java 10, and while 
it could be used with some workarounds, there's a lot of entropy around 
changing something like this.  It's not a trivial task to do it right, and 
mixing has some really nasty issues.

At least it means there's lots of room for improvement though :)


On Thu, Aug 9, 2018 at 5:36 AM Kyrylo Lebediev 
 wrote:

Thank you Jon, great article as usually!


One topic that was discussed in the article is filesystem cache which is 
traditionally leveraged for data caching in Cassandra (with row-caching 
disabled by default).

IIRC mmap() is used.

Some RDBMS and NoSQL DB's as well use direct I/O + async I/O + maintain own, 
not kernel-managed, DB Cache thus improving overall performance.

As Cassandra is designed to be a DB with low response time, this approach with 
DIO/AIO/DB Cache seems to be a really useful feature.

Just out of curiosity, are there reasons why this advanced IO stack wasn't 
implemented, except lack of resources to do this?


Regards,

Kyrill


From: Eric Plowe mailto:eric.pl...@gmail.com>>
Sent: Wednesday, August 8, 2018 9:39:44 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Compression Tuning Tutorial

Great post, Jonathan! Thank you very much.

~Eric

On Wed, Aug 8, 2018 at 2:34 PM Jonathan Haddad 
mailto:j...@jonhaddad.com>> wrote:
Hey folks,

We've noticed a lot over the years that people create tables usually leaving 
the default compression parameters, and have spent a lot of time helping teams 
figure out the right settings for their cluster based on their workload.  I 
finally managed to write some thoughts down along with a high level breakdown 
of how the internals function that should help people pick better settings for 
their cluster.

This post focuses on a mixed 50:50 read:write workload, but the same 
conclusions are drawn from a read heavy workload.  Hopefully this helps some 
folks get better performance / save some money on hardware!

http://thelastpickle.com/blog/2018/08/08/compression_performance.html


--
Jon Haddad
Principal Consultant, The Last Pickle


--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


Re: Compression Tuning Tutorial

2018-08-09 Thread Kyrylo Lebediev
Thank you Jon, great article as usually!


One topic that was discussed in the article is filesystem cache which is 
traditionally leveraged for data caching in Cassandra (with row-caching 
disabled by default).

IIRC mmap() is used.

Some RDBMS and NoSQL DB's as well use direct I/O + async I/O + maintain own, 
not kernel-managed, DB Cache thus improving overall performance.

As Cassandra is designed to be a DB with low response time, this approach with 
DIO/AIO/DB Cache seems to be a really useful feature.

Just out of curiosity, are there reasons why this advanced IO stack wasn't 
implemented, except lack of resources to do this?


Regards,

Kyrill


From: Eric Plowe 
Sent: Wednesday, August 8, 2018 9:39:44 PM
To: user@cassandra.apache.org
Subject: Re: Compression Tuning Tutorial

Great post, Jonathan! Thank you very much.

~Eric

On Wed, Aug 8, 2018 at 2:34 PM Jonathan Haddad 
mailto:j...@jonhaddad.com>> wrote:
Hey folks,

We've noticed a lot over the years that people create tables usually leaving 
the default compression parameters, and have spent a lot of time helping teams 
figure out the right settings for their cluster based on their workload.  I 
finally managed to write some thoughts down along with a high level breakdown 
of how the internals function that should help people pick better settings for 
their cluster.

This post focuses on a mixed 50:50 read:write workload, but the same 
conclusions are drawn from a read heavy workload.  Hopefully this helps some 
folks get better performance / save some money on hardware!

http://thelastpickle.com/blog/2018/08/08/compression_performance.html


--
Jon Haddad
Principal Consultant, The Last Pickle


Re: dynamic_snitch=false, prioritisation/order or reads from replicas

2018-08-08 Thread Kyrylo Lebediev
Thank you for explaining, Alain!


Predetermining the nodes to query, then sending 'data' request to one of them 
and 'digest' request to another (for CL=QUORUM, RF=3) indeed explains more 
effective use of filesystem cache when dynamic snitching is disabled.


So, there will be replica / replicas for each token range that will never be 
queried (2 replicas for CL=ONE, 1 replica for CL=QUORUM for RF=3). But taking 
into account that data is evenly distributed across all nodes in the cluster, 
looks like there shouldn't be any issues related to such load redistribution, 
except the case that you mentioned, when a node is having performance issues 
but all requests are being sent to in anyway.


Regards,

Kyrill



From: Alain RODRIGUEZ 
Sent: Wednesday, August 8, 2018 1:27:50 AM
To: user cassandra.apache.org
Subject: Re: dynamic_snitch=false, prioritisation/order or reads from replicas

Hello Kyrill,

But in case of CL=QUORUM/LOCAL_QUORUM, if I'm not wrong, read request is sent 
to all replicas waiting for first 2 to reply.

My understanding is that this sentence is wrong. It is as you described it for 
writes indeed, all the replicas got the information (and to all the data 
centers). It's not the case for reads. For reads, x nodes are picked and used 
(x = ONE, QUORUM, ALL, ...).

Looks like the only change for dynamic_snitch=false is that "data" request is 
sent to a determined node instead of "currently the fastest one".

Indeed, the problem is that the 'currently the fastest one' changes very often 
in certain cases, thus removing the efficiency from the cache without enough 
compensation in many cases.
The idea of not using the 'bad' nodes is interesting to have more predictable 
latencies when a node is slow for some reason. Yet one of the side effects of 
this (and of the scoring that does not seem to be absolutely reliable) is that 
the clients are often routed to distinct nodes when under pressure, due to GC 
pauses for example or any other pressure.
Saving disk reads in read-heavy workloads under pressure is more important than 
trying to save a few milliseconds picking the 'best' node I guess.
I can imagine that alleviating these disks, reducing the number of disk 
IO/throughput ends up lowering the latency for all the nodes, thus the client 
application latency improves overall. That is my understanding of why it is so 
often good to disable the dynamic_snitch.

Did you get improved response for CL=ONE only or for higher CL's as well?

I must admit I don't remember for sure, but many people are using 
'LOCAL_QUORUM' and I think I saw this for this consistency level as well. Plus 
this question might no longer stand as reads in Cassandra work slightly 
differently than what you thought.

I am not 100% comfortable with this 'dynamic_snitch theory' topic, so I hope 
someone else can correct me if I am wrong, confirm or add information :). But 
for sure I have seen this disabled giving some really nice improvement (as many 
others here as you mentioned). Sometimes it was not helpful, but I have never 
seen this change being really harmful though.

C*heers,
---
Alain Rodriguez - @arodream - 
al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-08-06 22:27 GMT+01:00 Kyrylo Lebediev 
mailto:kyrylo_lebed...@epam.com.invalid>>:

Thank you for replying, Alain!


Better use of cache for 'pinned' requests explains good the case when CL=ONE.


But in case of CL=QUORUM/LOCAL_QUORUM, if I'm not wrong, read request is sent 
to all replicas waiting for first 2 to reply.

When dynamic snitching is turned on, "data" request is sent to "the fastest 
replica", and "digest" requests - to the rest of replicas.

But anyway digest is the same read operation [from SSTables through filesystem 
cache] + calculating and sending hash to coordinator. Looks like the only 
change for dynamic_snitch=false is that "data" request is sent to a determined 
node instead of "currently the fastest one".

So, if there are no mistakes in above description, improvement shouldn't be 
much visible for CL=*QUORUM...


Did you get improved response for CL=ONE only or for higher CL's as well?


Indeed an interesting thread in Jira.


Thanks,

Kyrill


From: Alain RODRIGUEZ mailto:arodr...@gmail.com>>
Sent: Monday, August 6, 2018 8:26:43 PM
To: user cassandra.apache.org<http://cassandra.apache.org>
Subject: Re: dynamic_snitch=false, prioritisation/order or reads from replicas

Hello,

There are reports (in this ML too) that disabling dynamic snitching decreases 
response time.

I confirm that I have seen this improvement on clusters under pressure.

What effects stand behind this improvement?

My understanding is that this is due to the fact that the clients a

Re: dynamic_snitch=false, prioritisation/order or reads from replicas

2018-08-06 Thread Kyrylo Lebediev
Thank you for replying, Alain!


Better use of cache for 'pinned' requests explains good the case when CL=ONE.


But in case of CL=QUORUM/LOCAL_QUORUM, if I'm not wrong, read request is sent 
to all replicas waiting for first 2 to reply.

When dynamic snitching is turned on, "data" request is sent to "the fastest 
replica", and "digest" requests - to the rest of replicas.

But anyway digest is the same read operation [from SSTables through filesystem 
cache] + calculating and sending hash to coordinator. Looks like the only 
change for dynamic_snitch=false is that "data" request is sent to a determined 
node instead of "currently the fastest one".

So, if there are no mistakes in above description, improvement shouldn't be 
much visible for CL=*QUORUM...


Did you get improved response for CL=ONE only or for higher CL's as well?


Indeed an interesting thread in Jira.


Thanks,

Kyrill


From: Alain RODRIGUEZ 
Sent: Monday, August 6, 2018 8:26:43 PM
To: user cassandra.apache.org
Subject: Re: dynamic_snitch=false, prioritisation/order or reads from replicas

Hello,

There are reports (in this ML too) that disabling dynamic snitching decreases 
response time.

I confirm that I have seen this improvement on clusters under pressure.

What effects stand behind this improvement?

My understanding is that this is due to the fact that the clients are then 
'pinned', more sticking to specific nodes when the dynamic snitching is off. I 
guess there is a better use of caches and in-memory structures, reducing the 
amount of disk read needed, which can lead to way more performances than 
switching from node to node as soon as the score of some node is not good 
enough.
I am also not sure that the score calculation is always relevant, thus 
increasing the threshold before switching reads to another node is still often 
worst than disabling it completely. I am not sure if the score calculation was 
fixed, but in most cases, I think it's safer to run with 'dynamic_snitch: 
false'. Anyway, it's possible to test it on a canary node (or entire rack) and 
look at the p99 for read latencies for example :).

This ticket is old, but was precisely on that topic: 
https://issues.apache.org/jira/browse/CASSANDRA-6908

C*heers
---
Alain Rodriguez - @arodream - 
al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-08-04 15:37 GMT+02:00 Kyrylo Lebediev 
mailto:kyrylo_lebed...@epam.com.invalid>>:

Hello!


In case when dynamic snitching is enabled data is read from 'the fastest 
replica' and other replicas send digests for CL=QUORUM/LOCAL_QUORUM .

When dynamic snitching is disabled, as the concept of the fastest replica 
disappears, which rules are used to choose from which replica to read actual 
data (not digests):

 1) when all replicas are online

 2) when the node primarily responsible for the token range is offline.


There are reports (in this ML too) that disabling dynamic snitching decreases 
response time.

What effects stand behind this improvement?


Regards,

Kyrill



Re: Hinted Handoff

2018-08-06 Thread Kyrylo Lebediev
Small gc_grace_seconds value lowers max allowed node downtime, which is 15 
minutes in your case. After 15 minutes of downtime you'll need to replace the 
node, as you described. This interval looks too short to be able to do planned 
maintenance. So, in case you set larger value for gc_grace_seconds (lets say, 
hours or a day) will you get visible read amplification / waste a lot of disk 
space / issues with compactions?


Hinted handoff may be the reason in case hinted handoff window is longer than 
gc_grace_seconds. To me it looks like hinted handoff window 
(max_hint_window_in_ms in cassandra.yaml, which defaults to 3h) must always be 
set to a value less than gc_grace_seconds.


Regards,

Kyrill


From: Agrawal, Pratik 
Sent: Monday, August 6, 2018 8:22:27 PM
To: user@cassandra.apache.org
Subject: Hinted Handoff


Hello all,

We use Cassandra in non-conventional way, where our data is short termed (life 
cycle of about 20-30 minutes) where each record is updated ~5 times and then 
deleted. We have GC grace of 15 minutes.

We are seeing 2 problems

1.) A certain number of Cassandra nodes goes down and then we remove it from 
the cluster using Cassandra removenode command and replace the dead nodes with 
new nodes. While new nodes are joining in, we see more nodes down (which are 
not actually down) but we see following errors in the log

“Gossip not settled after 321 polls. Gossip Stage active/pending/completed: 
1/816/0”



To fix the issue, I restarted the server and the nodes now appear to be up and 
the problem is solved



Can this problem be related to 
https://issues.apache.org/jira/browse/CASSANDRA-6590 ?



2.) Meanwhile, after restarting the nodes mentioned above, we see that some old 
deleted data is resurrected (because of short lifecycle of our data). My guess 
at the moment is that these data is resurrected due to hinted handoff. 
Interesting point to note here is that data keeps resurrecting at periodic 
intervals (like an hour) and then finally stops. Could this be caused by hinted 
handoff? if so is there any setting which we can set to specify that 
“invalidate, hinted handoff data after 5-10 minutes”.



Thanks,
Pratik


Re: How to downloading Cassandra 3.11.0 and 3.11.2 binaries for ubuntu

2018-08-04 Thread Kyrylo Lebediev
Go to: http://dl.bintray.com/apache/cassandra/pool/main/c/cassandra/
download deb packages for the versions you need and install them by dpkg.

Regards,
Kyrill

From: R1 J1 
Sent: Saturday, August 4, 2018 4:56:56 PM
To: user@cassandra.apache.org
Subject: How to downloading Cassandra 3.11.0 and 3.11.2 binaries for ubuntu

What are the steps to download the  Cassandra 3.11.0 and 3.11.2 binaries for 
ubuntu ?

If we follow the steps below they give 3.11.3 binaries but need slightly older

http://cassandra.apache.org/download/

echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a 
/etc/apt/sources.list.d/cassandra.sources.list



curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -



sudo apt-get update



sudo apt-get install cassandra


Regards
R1J1


dynamic_snitch=false, prioritisation/order or reads from replicas

2018-08-04 Thread Kyrylo Lebediev
Hello!


In case when dynamic snitching is enabled data is read from 'the fastest 
replica' and other replicas send digests for CL=QUORUM/LOCAL_QUORUM .

When dynamic snitching is disabled, as the concept of the fastest replica 
disappears, which rules are used to choose from which replica to read actual 
data (not digests):

 1) when all replicas are online

 2) when the node primarily responsible for the token range is offline.


There are reports (in this ML too) that disabling dynamic snitching decreases 
response time.

What effects stand behind this improvement?


Regards,

Kyrill


Re: Performance impact of using NetworkTopology with 3 node cassandra cluster in One DC

2018-08-02 Thread Kyrylo Lebediev
There are two factors in terms of Cassandra that determine what's called 
network topology: datacenter and rack.

rack - it's not necessarily a physical rack, it's rather a single point of 
failure. For example, in case of AWS one availability zone is usually chosen to 
be a Cassandra rack.

datacenter - is a set of racks between which we have good network connection 
and low latency. Usually for AWS it's a region.


If you use NetworkTopologyStrategy + properly configured snitch, network 
topology is taken into account during replica placement: there won't be more 
than 1 replica of a data chunk in a rack. This means that if a whole rack fails 
(for example AWS AZ goes offline), there are still 2 other replicas online for 
each chunk of data (in case RF=3) and queries with CL=QUORUM are still working.

In order to avoid data imbalance between all the nodes (which may cause "hot 
spots" in your cluster = performance impact), all racks should have the same 
number of nodes with approximately the same capacity.


Also, sometimes CL=QUORUM isn't used correctly and CL=LOCAL_QUORUM should be 
used instead. There are no differences between the two in case of one DC, but 
in case of two and more DC's the former leads to cross-DC communication, as 
majority of all replicas across all DC's should be queried. This obviously 
leads to increased latencies. The same is true, for example, for ONE vs 
LOCAL_ONE.


If you take a look at the manual how to add  DC to a cluster you'll all find 
cautions about QUORUM/LOCAL_QUORUM there during the operation. The reason is 
when data which is supposed to be in the new DC isn't already there (as 
streaming is in progress and hasn't completed yet), it will cause blocking read 
repairs.


Regards,

Kyrill



From: Murtaza Talwari 
Sent: Thursday, August 2, 2018 1:22:16 PM
To: user@cassandra.apache.org
Subject: Performance impact of using NetworkTopology with 3 node cassandra 
cluster in One DC


We are using 3 node Cassandra cluster in one data center.



For our keyspaces as suggested in best practices we are using NetworkTopology 
for replication strategy using the GossipingPropertyFileSnitch.

For Read/Write consistency we are using as QUORUM.



In majority of cases when users use NetworkTopology as replication strategy 
they might have multiple DataCenters configured.

In our case we have only one DataCenter,



  *   With that using the NetworkTopology as replication strategy will it cause 
any performance impact ?
  *   As we are using QUORUM as Read/Write consistency which is considering 
multiple DataCenters, does QUORUM consistency have any performance impact ? is 
it OK to continue using QUORUM consistency considering future expansions of 
data centers ?



Please suggest.



Regards,



Re: What will happen after adding another data disk

2018-06-12 Thread Kyrylo Lebediev
Also it worth noting, that usage of JBOD isn't recommended for older Cassandra 
versions, as there are known issues with data imbalance on JBOD.

iirc JBOD data imbalance was fixed in some 3.x version (3.2?)

For older versions creation one large filesystem on top md or lvm device seems 
to be a better choice.



From: Eunsu Kim 
Sent: Tuesday, June 12, 2018 9:06:07 AM
To: user@cassandra.apache.org
Subject: Re: What will happen after adding another data disk

In my experience, adding a new disk and restarting the Cassandra process slowly 
distributes the disk usage evenly, so that existing disks have less disk usage

On 12 Jun 2018, at 11:09 AM, wxn...@zjqunshuo.com 
wrote:

Hi,
I know Cassandra can make use of multiple disks. My data disk is almost full 
and I want to add another 2TB disk. I don't know what will happen after the 
addition.
1. C* will write to both disks util the old disk is full?
2. And what will happen after the old one is full? Will C* stop writing to the 
old one and only writing to the new one with free space?

Thanks!



Re: compaction: huge number of random reads

2018-05-08 Thread Kyrylo Lebediev
You are right, Kurt, it's what I was trying to do - lowering compression chunk 
size and device read-ahead.

Column-family settings: "compression = {'chunk_length_kb': '16', 
'sstable_compression': 'org.apache.cassandra.io.compress.SnappyCompressor'}"
Device read-ahead: blockdev --setra 8 

I had to fallback to default RA 256 and got large merged reads and small iops 
with good MBytes/sec after this.
I believe it's not caused by C* settings, but it's something with filesystem / 
IO-related kernel settings (or it's by design?).


Tried to emulate C* reads during compactions by dd:


**  RA=8 (4k)

# blockdev --setra 8 /dev/xvdb
# dd if=/dev/zero of=/data/ZZZ
^C16980952+0 records in
16980951+0 records out
8694246912 bytes (8.7 GB, 8.1 GiB) copied, 36.4651 s, 238 MB/s
# sync

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/data/ZZZ of=/dev/null
^C846513+0 records in
846512+0 records out
433414144 bytes (433 MB, 413 MiB) copied, 21.4604 s, 20.2 MB/s   <<<<<

High IOPS in this case, io size = 4k.
What's interesting, setting bs=128k in dd didn't decrease iops, io size still 
was 4k


** RA=256 (128k):
# blockdev --setra 256 /dev/xvdb
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/data/ZZZ of=/dev/null
^C15123937+0 records in
15123936+0 records out
7743455232 bytes (7.7 GB, 7.2 GiB) copied, 60.8407 s, 127 MB/s  <<<<<<

io size - 128k, small iops, good throughput (limited by EBS bandwidth)

Writes were fine in both cases: io size 128k, good throughput limited by EBS 
bandwidth only

Is above situation typical for small read-ahead ("price for small fast reads") 
or it's something wrong with my setup?
[It's not XFS mailing list, but as somebody here may know this, ] Why in case 
of small RA even large reads (bs=128k) are converted to multiple small reads?

Regards,
Kyrill



From: kurt greaves <k...@instaclustr.com>
Sent: Tuesday, May 8, 2018 2:12:40 AM
To: User
Subject: Re: compaction: huge number of random reads

If you've got small partitions/small reads you should test lowering your 
compression chunk size on the table and disabling read ahead. This sounds like 
it might just be a case of read amplification.

On Tue., 8 May 2018, 05:43 Kyrylo Lebediev, 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Dear Experts,


I'm observing strange behavior on a cluster 2.1.20 during compactions.


My setup is:

12 nodes  m4.2xlarge (8 vCPU, 32G RAM) Ubuntu 16.04, 2T EBS gp2.

Filesystem: XFS, blocksize 4k, device read-ahead - 4k

/sys/block/vxdb/queue/nomerges = 0

SizeTieredCompactionStrategy


After data loads when effectively nothing else is talking to the cluster and 
compactions is the only activity, I see something like this:
$ iostat -dkx 1
...


Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvda  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdb  0.00 0.00 4769.00  213.00 19076.00 26820.0018.42 
7.951.171.063.76   0.20 100.00

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvda  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdb  0.00 0.00 6098.00  177.00 24392.00 22076.0014.81 
6.461.360.96   15.16   0.16 100.00

Writes are fine: 177 writes/sec <-> ~22Mbytes/sec,

But for some reason compactions generate a huge number of small reads:
6098 reads/s <-> ~24Mbytes/sec.  ===>   Read size is 4k


Why instead much smaller amount of large reads I'm getting huge number of 4k 
reads instead?

What could be the reason?


Thanks,

Kyrill




compaction: huge number of random reads

2018-05-07 Thread Kyrylo Lebediev
Dear Experts,


I'm observing strange behavior on a cluster 2.1.20 during compactions.


My setup is:

12 nodes  m4.2xlarge (8 vCPU, 32G RAM) Ubuntu 16.04, 2T EBS gp2.

Filesystem: XFS, blocksize 4k, device read-ahead - 4k

/sys/block/vxdb/queue/nomerges = 0

SizeTieredCompactionStrategy


After data loads when effectively nothing else is talking to the cluster and 
compactions is the only activity, I see something like this:
$ iostat -dkx 1
...


Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvda  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdb  0.00 0.00 4769.00  213.00 19076.00 26820.0018.42 
7.951.171.063.76   0.20 100.00

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
xvda  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
xvdb  0.00 0.00 6098.00  177.00 24392.00 22076.0014.81 
6.461.360.96   15.16   0.16 100.00

Writes are fine: 177 writes/sec <-> ~22Mbytes/sec,

But for some reason compactions generate a huge number of small reads:
6098 reads/s <-> ~24Mbytes/sec.  ===>   Read size is 4k


Why instead much smaller amount of large reads I'm getting huge number of 4k 
reads instead?

What could be the reason?


Thanks,

Kyrill




Re: 3.11.2 io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)

2018-05-07 Thread Kyrylo Lebediev
Hi!

You are talking about  messages like below, right?


INFO  [epollEventLoopGroup-2-8] 2018-05-05 16:57:45,537 Message.java:623 - 
Unexpected exception during request; channel = [id: 0xa4879fdd, 
L:/10.175.20.112:9042 - R:/10.175.20.73:2508]

io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: 
Connection reset by peer

at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown 
Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]

As far as I understand, these messages mean that some C* client (10.175.20.73 
in your case) closed connection abnormally and the issue is on client side. So, 
this should be investigated on client side. I'd say if your app works fine it's 
up to you to investigate the issue or not ("due diligence").


At the same time there are some messages in the log related to hinted-handoff, 
which could mean your cluster suffers from write timeouts (unless these 
messages are caused by temporary unavailability of  10.175.20.114 because of 
node reboot etc). This definitely worth investigation. It can be caused by: 
non-optimal GC setting (btw messages like "ParNew GC in 309ms"  may imply this, 
delay of 309 ms is too long), hw performance issues, networking issues etc.


Regards,

Kyrill



From: Xiangfei Ni 
Sent: Monday, May 7, 2018 5:44:21 AM
To: user@cassandra.apache.org
Subject: 3.11.2 io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown 
Source)


Hi Expert,

  I have this error in the Cassandra logs,the nodes are ok and behave as normal,

INFO  [epollEventLoopGroup-2-8] 2018-05-05 16:57:45,537 Message.java:623 - 
Unexpected exception during request; channel = [id: 0xa4879fdd, 
L:/10.175.20.112:9042 - R:/10.175.20.73:2508]

io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: 
Connection reset by peer

at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown 
Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]

INFO  [epollEventLoopGroup-2-9] 2018-05-05 16:57:45,537 Message.java:623 - 
Unexpected exception during request; channel = [id: 0x86e1a3d9, 
L:/10.175.20.112:9042 - R:/10.175.20.37:44006]

io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: 
Connection reset by peer

at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown 
Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]

INFO  [HintsDispatcher:43] 2018-05-05 16:57:49,991 HintsStore.java:126 - 
Deleted hint file 5696cbef-edd1-42a6-b4f1-3c2f0881b8cb-1525510648347-1.hints

INFO  [HintsDispatcher:43] 2018-05-05 16:57:49,991 
HintsDispatchExecutor.java:282 - Finished hinted handoff of file 
5696cbef-edd1-42a6-b4f1-3c2f0881b8cb-1525510648347-1.hints to endpoint 
/10.175.20.114: 5696cbef-edd1-42a6-b4f1-3c2f0881b8cb

INFO  [Service Thread] 2018-05-05 16:58:01,159 GCInspector.java:284 - ParNew GC 
in 309ms.  CMS Old Gen: 1258291920 -> 1258482424; Par Eden Space: 503316480 -> 
0; Par Survivor Space: 700080 -> 782256

WARN  [GossipTasks:1] 2018-05-05 16:58:23,273 Gossiper.java:791 - Gossip stage 
has 2 pending tasks; skipping status check (no nodes will be marked down)

INFO  [HintsDispatcher:44] 2018-05-05 16:58:33,913 HintsStore.java:126 - 
Deleted hint file 21033fe2-dca9-42a5-8e68-496dd3517f53-1525510690633-1.hints

INFO  [HintsDispatcher:44] 2018-05-05 16:58:33,913 
HintsDispatchExecutor.java:282 - Finished hinted handoff of file 
21033fe2-dca9-42a5-8e68-496dd3517f53-1525510690633-1.hints to endpoint 
/10.175.20.113: 21033fe2-dca9-42a5-8e68-496dd3517f53

INFO  [epollEventLoopGroup-2-10] 2018-05-05 16:58:41,750 Message.java:623 - 
Unexpected exception during request; channel = [id: 0xc640c7a2, 
L:/10.175.20.112:9042 - R:/10.175.20.37:56784]

io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: 
Connection reset by peer

at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown 
Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]

the limit setting in CentOS 6.8 is setting as below :

* hard memlock unlimited

* hard nofile 10

* soft nofile 10

* soft memlock unlimited

* hard nproc 32768

* hard as unlimited

* soft nproc 32768

* soft as unlimited

root soft nofile 32768

root hard nofile 32768



  can I ignore this error or do you have any suggestions about this error?



Best Regards,



倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516




Re: Host not available

2018-04-30 Thread Kyrylo Lebediev
First of all you need to identify what's the bottleneck in your case.

First things to check:

1) jvm - heap is too small for such workloads

Enable GC logging in /etc/cassandra/cassandra-env.sh, then analyze its output 
during workload. In case you observe messages about long pauses  or Full GC's, 
most probably it's the root-cause.

2) hardware - monitor cpu, disks when writes are active.


there is a really good command 'nodetool tpstats' which helps identify 
bottleneck / understand what's going on.


Two articles that may help (credits to Chris and Jon):

https://blog.pythian.com/guide-to-cassandra-thread-pools/

http://thelastpickle.com/blog/2018/04/11/gc-tuning.html


Regards,

Kyrill


From: Soheil Pourbafrani 
Sent: Monday, April 30, 2018 10:40:10 AM
To: user@cassandra.apache.org
Subject: Host not available

I have 3 node Cassandra 3.11 cluster, one node 4GB memory, and others 2GB. All 
of them have 4 CPU core. I want to insert data into the table and read it to 
visualize at the san\me time. When I insert data using Flink Cassandra 
Connector, in rate 200 inserts per sec, the reader application can't connect to 
the hosts, but when the insertion process is not running, the reader 
application can connect and fetch data.

All Cassandra,yaml properties are defaults.
Two nodes are seed nodes and keyspace replication factor is 2.

How can I optimize the cluster and solve the problem?

In addition, I have the following log:

 INFO  [epollEventLoopGroup-2-3] 2018-04-30 12:05:11,225 Message.java:623 - 
Unexpected exception during request; channel = [id: 0xb84fd5cc, 
L:/192.168.1.218:9042 - R:/1$
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: 
Connection reset by peer



Re: copy from one table to another

2018-04-24 Thread Kyrylo Lebediev
Thank you,  Rahul!

From: Rahul Singh <rahul.xavier.si...@gmail.com>
Sent: Saturday, April 21, 2018 3:02:11 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

That’s correct.

On Apr 21, 2018, 5:05 AM -0400, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:

You mean that correct table UUID should be specified as suffix in directory 
name?
For example:


Table:


cqlsh> select id from system_schema.tables where keyspace_name='test' and 
table_name='usr';

 id
--
 ea2f6da0-f931-11e7-8224-43ca70555242


Directory name:
./data/test/usr-ea2f6da0f93111e7822443ca70555242


Correct?


Regards,

Kyrill


From: Rahul Singh <rahul.xavier.si...@gmail.com>
Sent: Thursday, April 19, 2018 10:53:11 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

Each table has a different Guid — doing a hard link may work as long as the 
sstable dir’s guid is he same as the newly created table in the system schema.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 19, 2018, 10:41 AM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:

The table is too large to be copied fast/effectively , so I'd like to leverage 
immutableness  property of SSTables.

My idea is to:

1) create new empty table (NewTable) with the same structure as existing one 
(OldTable)
2) at some time run simultaneous 'nodetool snapshot -t ttt  OldTable' 
on all nodes -- this will create point in time state of OldTable

3) on each node run:
   for each file in OldTable ttt snapshot directory:

 ln 
//OldTable-/snapshots/ttt/_OldTable_xx 
.//Newtable/_NewTable_x

 then:
 nodetool refresh  NewTable

4) nodetool repair NewTable
5) Use OldTable and NewTable independently (Read/Write)


Are there any issues with using hardlinks (ln) instead of copying (cp) in this 
case?


Thanks,

Kyrill



From: Rahul Singh <rahul.xavier.si...@gmail.com>
Sent: Wednesday, April 18, 2018 2:07:17 AM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

1. Make a new table with the same schema.
For each node
2. Shutdown node
3. Copy data from Source sstable dir to new sstable dir.

This will do what you want.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 16, 2018, 4:21 PM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:
Thanks,  Ali.
I just need to copy a large table in production without actual copying by using 
hardlinks. After this both tables should be used independently (RW). Is this a 
supported way or not?

Regards,
Kyrill

From: Ali Hubail <ali.hub...@petrolink.com>
Sent: Monday, April 16, 2018 6:51:51 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

If you want to copy a portion of the data to another table, you can also use 
sstable cql writer. It is more of an advanced feature and can be tricky, but 
doable.
once you write the new sstables, you can then use the sstableloader to stream 
the new data into the new table.
check this out:
https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

I have recently used this to clean up 500 GB worth of sstable data in order to 
purge tombstones that were mistakenly generated by the client.
obviously this is not as fast as hardlinks + refresh, but it's much faster and 
more efficient than using cql to copy data accross the tables.
take advantage of CQLSSTableWriter.builder.sorted() if you can, and utilize 
writetime if you have to.

Ali Hubail

Confidentiality warning: This message and any attachments are intended only for 
the persons to whom this message is addressed, are confidential, and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, retransmission, conversion to hard copy, copying, modification, 
circulation or other use of this message and any attachments is strictly 
prohibited. If you receive this message in error, please notify the sender 
immediately by return email, and delete this message and any attachments from 
your system. Petrolink International Limited its subsidiaries, holding 
companies and affiliates disclaims all responsibility from and accepts no 
liability whatsoever for the consequences of any unauthorized person acting, or 
refraining from acting, on any information contained in this message. For 
security purposes, staff training, to assist in resolving complaints and to 
improve our customer service, email communications may be monitored and 
telephone calls may be recorded.


Kyrylo Lebediev <kyrylo_lebed...@epam.com>

04/16/2018 10:37 AM

Please respond to
user@cassandra.apache.org




To
"user@cassandra.apache.org" <user@cassandra.apache.org>,
cc

Subject
Re: copy from one table to another







Any 

Re: copy from one table to another

2018-04-21 Thread Kyrylo Lebediev
You mean that correct table UUID should be specified as suffix in directory 
name?
For example:


Table:


cqlsh> select id from system_schema.tables where keyspace_name='test' and 
table_name='usr';

 id
--
 ea2f6da0-f931-11e7-8224-43ca70555242


Directory name:
./data/test/usr-ea2f6da0f93111e7822443ca70555242


Correct?


Regards,

Kyrill


From: Rahul Singh <rahul.xavier.si...@gmail.com>
Sent: Thursday, April 19, 2018 10:53:11 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

Each table has a different Guid — doing a hard link may work as long as the 
sstable dir’s guid is he same as the newly created table in the system schema.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 19, 2018, 10:41 AM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:

The table is too large to be copied fast/effectively , so I'd like to leverage 
immutableness  property of SSTables.

My idea is to:

1) create new empty table (NewTable) with the same structure as existing one 
(OldTable)
2) at some time run simultaneous 'nodetool snapshot -t ttt  OldTable' 
on all nodes -- this will create point in time state of OldTable

3) on each node run:
   for each file in OldTable ttt snapshot directory:

 ln 
//OldTable-/snapshots/ttt/_OldTable_xx 
.//Newtable/_NewTable_x

 then:
 nodetool refresh  NewTable

4) nodetool repair NewTable
5) Use OldTable and NewTable independently (Read/Write)


Are there any issues with using hardlinks (ln) instead of copying (cp) in this 
case?


Thanks,

Kyrill



From: Rahul Singh <rahul.xavier.si...@gmail.com>
Sent: Wednesday, April 18, 2018 2:07:17 AM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

1. Make a new table with the same schema.
For each node
2. Shutdown node
3. Copy data from Source sstable dir to new sstable dir.

This will do what you want.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 16, 2018, 4:21 PM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:
Thanks,  Ali.
I just need to copy a large table in production without actual copying by using 
hardlinks. After this both tables should be used independently (RW). Is this a 
supported way or not?

Regards,
Kyrill

From: Ali Hubail <ali.hub...@petrolink.com>
Sent: Monday, April 16, 2018 6:51:51 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

If you want to copy a portion of the data to another table, you can also use 
sstable cql writer. It is more of an advanced feature and can be tricky, but 
doable.
once you write the new sstables, you can then use the sstableloader to stream 
the new data into the new table.
check this out:
https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

I have recently used this to clean up 500 GB worth of sstable data in order to 
purge tombstones that were mistakenly generated by the client.
obviously this is not as fast as hardlinks + refresh, but it's much faster and 
more efficient than using cql to copy data accross the tables.
take advantage of CQLSSTableWriter.builder.sorted() if you can, and utilize 
writetime if you have to.

Ali Hubail

Confidentiality warning: This message and any attachments are intended only for 
the persons to whom this message is addressed, are confidential, and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, retransmission, conversion to hard copy, copying, modification, 
circulation or other use of this message and any attachments is strictly 
prohibited. If you receive this message in error, please notify the sender 
immediately by return email, and delete this message and any attachments from 
your system. Petrolink International Limited its subsidiaries, holding 
companies and affiliates disclaims all responsibility from and accepts no 
liability whatsoever for the consequences of any unauthorized person acting, or 
refraining from acting, on any information contained in this message. For 
security purposes, staff training, to assist in resolving complaints and to 
improve our customer service, email communications may be monitored and 
telephone calls may be recorded.


Kyrylo Lebediev <kyrylo_lebed...@epam.com>

04/16/2018 10:37 AM

Please respond to
user@cassandra.apache.org




To
"user@cassandra.apache.org" <user@cassandra.apache.org>,
cc

Subject
Re: copy from one table to another







Any issues if we:

1) create an new empty table with the same structure as the old one
2) create hardlinks ("ln without -s"): 
.../-/--* ---> 
.../-/--*
3) run nodetool refresh -- newkeyspacename newtable

and then query/modify both tables independently/simultaneously?

In theory, as SSTables are immutable, this s

Re: copy from one table to another

2018-04-19 Thread Kyrylo Lebediev
The table is too large to be copied fast/effectively , so I'd like to leverage 
immutableness  property of SSTables.

My idea is to:

1) create new empty table (NewTable) with the same structure as existing one 
(OldTable)
2) at some time run simultaneous 'nodetool snapshot -t ttt  OldTable' 
on all nodes -- this will create point in time state of OldTable

3) on each node run:
   for each file in OldTable ttt snapshot directory:

 ln 
//OldTable-/snapshots/ttt/_OldTable_xx 
.//Newtable/_NewTable_x

 then:
 nodetool refresh  NewTable

4) nodetool repair NewTable
5) Use OldTable and NewTable independently (Read/Write)


Are there any issues with using hardlinks (ln) instead of copying (cp) in this 
case?


Thanks,

Kyrill



From: Rahul Singh <rahul.xavier.si...@gmail.com>
Sent: Wednesday, April 18, 2018 2:07:17 AM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

1. Make a new table with the same schema.
For each node
2. Shutdown node
3. Copy data from Source sstable dir to new sstable dir.

This will do what you want.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 16, 2018, 4:21 PM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:
Thanks,  Ali.
I just need to copy a large table in production without actual copying by using 
hardlinks. After this both tables should be used independently (RW). Is this a 
supported way or not?

Regards,
Kyrill

From: Ali Hubail <ali.hub...@petrolink.com>
Sent: Monday, April 16, 2018 6:51:51 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

If you want to copy a portion of the data to another table, you can also use 
sstable cql writer. It is more of an advanced feature and can be tricky, but 
doable.
once you write the new sstables, you can then use the sstableloader to stream 
the new data into the new table.
check this out:
https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

I have recently used this to clean up 500 GB worth of sstable data in order to 
purge tombstones that were mistakenly generated by the client.
obviously this is not as fast as hardlinks + refresh, but it's much faster and 
more efficient than using cql to copy data accross the tables.
take advantage of CQLSSTableWriter.builder.sorted() if you can, and utilize 
writetime if you have to.

Ali Hubail

Confidentiality warning: This message and any attachments are intended only for 
the persons to whom this message is addressed, are confidential, and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, retransmission, conversion to hard copy, copying, modification, 
circulation or other use of this message and any attachments is strictly 
prohibited. If you receive this message in error, please notify the sender 
immediately by return email, and delete this message and any attachments from 
your system. Petrolink International Limited its subsidiaries, holding 
companies and affiliates disclaims all responsibility from and accepts no 
liability whatsoever for the consequences of any unauthorized person acting, or 
refraining from acting, on any information contained in this message. For 
security purposes, staff training, to assist in resolving complaints and to 
improve our customer service, email communications may be monitored and 
telephone calls may be recorded.


Kyrylo Lebediev <kyrylo_lebed...@epam.com>

04/16/2018 10:37 AM

Please respond to
user@cassandra.apache.org




To
"user@cassandra.apache.org" <user@cassandra.apache.org>,
cc

Subject
Re: copy from one table to another







Any issues if we:

1) create an new empty table with the same structure as the old one
2) create hardlinks ("ln without -s"): 
.../-/--* ---> 
.../-/--*
3) run nodetool refresh -- newkeyspacename newtable

and then query/modify both tables independently/simultaneously?

In theory, as SSTables are immutable, this should work, but could there be some 
hidden issues?

Regards,
Kyrill



From: Dmitry Saprykin <saprykin.dmi...@gmail.com>
Sent: Sunday, April 8, 2018 7:33:03 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

You can copy hardlinks to ALL SSTables from old to new table and then delete 
part of data you do not need in a new one.

On Sun, Apr 8, 2018 at 10:20 AM, Nitan Kainth 
<nitankai...@gmail.com<mailto:nitankai...@gmail.com>> wrote:
If it for testing and you don’t need any specific data, just copy a set of 
sstables with all files of that sequence and move to target tables directory 
and rename it.

Restart target node or run nodetool refresh

Sent from my iPhone

On Apr 8, 2018, at 4:15 AM, onmstester onmstester 
<onmstes...@zoho.com<mailto:onmstes...@zoho.com>> wrote:

Is there any way to copy

Re: copy from one table to another

2018-04-16 Thread Kyrylo Lebediev
Thanks,  Ali.
I just need to copy a large table in production without actual copying by using 
hardlinks. After this both tables should be used independently (RW). Is this a 
supported way or not?

Regards,
Kyrill

From: Ali Hubail <ali.hub...@petrolink.com>
Sent: Monday, April 16, 2018 6:51:51 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

If you want to copy a portion of the data to another table, you can also use 
sstable cql writer. It is more of an advanced feature and can be tricky, but 
doable.
once you write the new sstables, you can then use the sstableloader to stream 
the new data into the new table.
check this out:
https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

I have recently used this to clean up 500 GB worth of sstable data in order to 
purge tombstones that were mistakenly generated by the client.
obviously this is not as fast as hardlinks + refresh, but it's much faster and 
more efficient than using cql to copy data accross the tables.
take advantage of CQLSSTableWriter.builder.sorted() if you can, and utilize 
writetime if you have to.

Ali Hubail

Confidentiality warning: This message and any attachments are intended only for 
the persons to whom this message is addressed, are confidential, and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, retransmission, conversion to hard copy, copying, modification, 
circulation or other use of this message and any attachments is strictly 
prohibited. If you receive this message in error, please notify the sender 
immediately by return email, and delete this message and any attachments from 
your system. Petrolink International Limited its subsidiaries, holding 
companies and affiliates disclaims all responsibility from and accepts no 
liability whatsoever for the consequences of any unauthorized person acting, or 
refraining from acting, on any information contained in this message. For 
security purposes, staff training, to assist in resolving complaints and to 
improve our customer service, email communications may be monitored and 
telephone calls may be recorded.


Kyrylo Lebediev <kyrylo_lebed...@epam.com>

04/16/2018 10:37 AM
Please respond to
user@cassandra.apache.org




To
"user@cassandra.apache.org" <user@cassandra.apache.org>,
cc

Subject
Re: copy from one table to another







Any issues if we:

1) create an new empty table with the same structure as the old one
2) create hardlinks ("ln without -s"): 
.../-/--* ---> 
.../-/--*
3) run nodetool refresh -- newkeyspacename newtable

and then query/modify both tables independently/simultaneously?

In theory, as SSTables are immutable, this should work, but could there be some 
hidden issues?

Regards,
Kyrill



From: Dmitry Saprykin <saprykin.dmi...@gmail.com>
Sent: Sunday, April 8, 2018 7:33:03 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

You can copy hardlinks to ALL SSTables from old to new table and then delete 
part of data you do not need in a new one.

On Sun, Apr 8, 2018 at 10:20 AM, Nitan Kainth 
<nitankai...@gmail.com<mailto:nitankai...@gmail.com>> wrote:
If it for testing and you don’t need any specific data, just copy a set of 
sstables with all files of that sequence and move to target tables directory 
and rename it.

Restart target node or run nodetool refresh

Sent from my iPhone

On Apr 8, 2018, at 4:15 AM, onmstester onmstester 
<onmstes...@zoho.com<mailto:onmstes...@zoho.com>> wrote:

Is there any way to copy some part of a table to another table in cassandra? A 
large amount of data should be copied so i don't want to fetch data to client 
and stream it back to cassandra using cql.

Sent using Zoho Mail<https://www.zoho.com/mail/>





Re: copy from one table to another

2018-04-16 Thread Kyrylo Lebediev
Any issues if we:


1) create an new empty table with the same structure as the old one

2) create hardlinks ("ln without -s"): 
.../-/--* ---> 
.../-/--*

3) run nodetool refresh -- newkeyspacename newtable


and then query/modify both tables independently/simultaneously?


In theory, as SSTables are immutable, this should work, but could there be some 
hidden issues?


Regards,

Kyrill


From: Dmitry Saprykin 
Sent: Sunday, April 8, 2018 7:33:03 PM
To: user@cassandra.apache.org
Subject: Re: copy from one table to another

You can copy hardlinks to ALL SSTables from old to new table and then delete 
part of data you do not need in a new one.

On Sun, Apr 8, 2018 at 10:20 AM, Nitan Kainth 
> wrote:
If it for testing and you don’t need any specific data, just copy a set of 
sstables with all files of that sequence and move to target tables directory 
and rename it.

Restart target node or run nodetool refresh

Sent from my iPhone

On Apr 8, 2018, at 4:15 AM, onmstester onmstester 
> wrote:

Is there any way to copy some part of a table to another table in cassandra? A 
large amount of data should be copied so i don't want to fetch data to client 
and stream it back to cassandra using cql.


Sent using Zoho Mail





Re: Adding disk to operating C*

2018-03-09 Thread Kyrylo Lebediev
Niclas,

Here is Jeff's comment regarding this: https://stackoverflow.com/a/31690279


From: Niclas Hedhman 
Sent: Friday, March 9, 2018 9:09:53 AM
To: user@cassandra.apache.org; Rahul Singh
Subject: Re: Adding disk to operating C*

I am curious about the side comment; "Depending on your usecase you may not
want to have a data density over 1.5 TB per node."

Why is that? I am planning much bigger than that, and now you give me
pause...


Cheers
Niclas

On Wed, Mar 7, 2018 at 6:59 PM, Rahul Singh 
> wrote:
Are you putting both the commitlogs and the Sstables on the adds? Consider 
moving your snapshots often if that’s also taking up space. Maybe able to save 
some space before you add drives.

You should be able to add these new drives and mount them without an issue. Try 
to avoid different number of data dirs across nodes. It makes automation of 
operational processes a little harder.

As an aside, Depending on your usecase you may not want to have a data density 
over 1.5 TB per node.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 7, 2018, 1:26 AM -0500, Eunsu Kim 
>, wrote:
Hello,

I use 5 nodes to create a cluster of Cassandra. (SSD 1TB)

I'm trying to mount an additional disk(SSD 1TB) on each node because each disk 
usage growth rate is higher than I expected. Then I will add the the directory 
to data_file_directories in cassanra.yaml

Can I get advice from who have experienced this situation?
If we go through the above steps one by one, will we be able to complete the 
upgrade without losing data?
The replication strategy is SimpleStrategy, RF 2.

Thank you in advance
-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org




--
Niclas Hedhman, Software Developer
http://zest.apache.org - New Energy for Java


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Kyrylo Lebediev
Not sure where I heard this, but AFAIK data imbalance when multiple 
data_directories are in use is a known issue for older versions of Cassandra. 
This might be the root-cause of your issue.

Which version of C* are you using?

Unfortunately, don't remember in which version this imbalance issue was fixed.


-- Kyrill


From: Yasir Saleem 
Sent: Friday, March 9, 2018 1:34:08 PM
To: user@cassandra.apache.org
Subject: Re: uneven data movement in one of the disk in Cassandra

Hi Alex,

no active compaction, right now.

[cid:ii_jejv51ck1_1620a89ebd6c7e92]


On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin 
> wrote:
On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem 
> wrote:
Thanks, Nicolas Guyomar

I am new to cassandra, here is the properties which I can see in yaml file:

# of compaction, including validation compaction.
compaction_throughput_mb_per_sec: 16
compaction_large_partition_warning_threshold_mb: 100

To check currently active compaction please use this command:

nodetool compactionstats -H

on the host which shows the problem.

--
Alex




Re: Amazon Time Sync Service + ntpd vs chrony

2018-03-09 Thread Kyrylo Lebediev
Thank you to all who replied so far,  thank you Ben for the links you provided!


From: Ben Slater <ben.sla...@instaclustr.com>
Sent: Friday, March 9, 2018 12:12:09 AM
To: user@cassandra.apache.org
Subject: Re: Amazon Time Sync Service + ntpd vs chrony

It is important to make sure you are using the same NTP servers across your 
cluster - we used to see relatively frequent NTP issues across our fleet using 
default/public NTP servers until (back in 2015) we implemented our own NTP pool 
(see https://www.instaclustr.com/apache-cassandra-synchronization/ which 
references some really good and detailed posts from 
logentries.com<http://logentries.com> on the potential issues).

Cheers
Ben

On Fri, 9 Mar 2018 at 02:07 Michael Shuler 
<mich...@pbandjelly.org<mailto:mich...@pbandjelly.org>> wrote:
As long as your nodes are syncing time using the same method, that
should be good. Don't mix daemons, however, since they may sync from
different sources. Whether you use ntpd, openntp, ntpsec, chrony isn't
really important, since they are all just background daemons to sync the
system clock. There is nothing Cassandra-specific.

--
Kind regards,
Michael

On 03/08/2018 04:15 AM, Kyrylo Lebediev wrote:
> Hi!
>
> Recently Amazon announced launch of Amazon Time Sync Service
> (https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/)
> and now it's AWS-recommended way for time sync on EC2 instances
> (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html).
> It's stated there that chrony is faster / more precise than ntpd.
>
> Nothing to say correct time sync configuration is very important for any
> C* setup.
>
> Does anybody have positive experience using crony, Amazon Time Sync
> Service with Cassandra and/or combination of them?
> Any concerns regarding chrony + Amazon Time Sync Service + Cassandra?
> Are there any chrony best-practices/custom settings for C* setups?
>
> Thanks,
> Kyrill
>


-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org<mailto:user-unsubscr...@cassandra.apache.org>
For additional commands, e-mail: 
user-h...@cassandra.apache.org<mailto:user-h...@cassandra.apache.org>

--

Ben Slater
Chief Product Officer
[https://cdn2.hubspot.net/hubfs/2549680/Instaclustr-Navy-logo-new.png]<https://www.instaclustr.com/>

[http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/facebook_sig.png]<https://www.facebook.com/instaclustr>
  
[http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/twitter_sig.png] 
<https://twitter.com/instaclustr>   
[http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/linkedin_sig.png]
 <https://www.linkedin.com/company/instaclustr>

Read our latest technical blog posts here<https://www.instaclustr.com/blog/>.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and 
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.


Amazon Time Sync Service + ntpd vs chrony

2018-03-08 Thread Kyrylo Lebediev
Hi!

Recently Amazon announced launch of Amazon Time Sync Service 
(https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/) 
and now it's AWS-recommended way for time sync on EC2 instances 
(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html). It's 
stated there that chrony is faster / more precise than ntpd.

Nothing to say correct time sync configuration is very important for any C* 
setup.

Does anybody have positive experience using crony, Amazon Time Sync Service 
with Cassandra and/or combination of them?
Any concerns regarding chrony + Amazon Time Sync Service + Cassandra?
Are there any chrony best-practices/custom settings for C* setups?

Thanks,
Kyrill



Re: Rocksandra blog post

2018-03-06 Thread Kyrylo Lebediev
Thanks for sharing, Dikang!

Impressive results.


As you plugged in different storage engine, it's interesting how you're dealing 
with compactions in Rocksandra?

Is there still the concept of immutable SSTables + compaction strategies or it 
was changed somehow?


Best,

Kyrill


From: Dikang Gu 
Sent: Monday, March 5, 2018 8:26 PM
To: d...@cassandra.apache.org; cassandra
Subject: Rocksandra blog post

As some of you already know, Instagram Cassandra team is working on the project 
to use RocksDB as Cassandra's storage engine.

Today, we just published a blog post about the work we have done, and more 
excitingly, we published the benchmark metrics in AWS environment.

Check it out here:
https://engineering.instagram.com/open-sourcing-a-10x-reduction-in-apache-cassandra-tail-latency-d64f86b43589

Thanks
Dikang



Re: How do counter updates work?

2018-03-05 Thread Kyrylo Lebediev
Hello!

Can't answer your question but there is another one: "why do we need to 
maintain counters with their known limitations (and I've heard of some issues 
with implementation of counters in Cassandra), when there exist really 
effective uuid generation algorithms which allow us to generate unique values?"
https://begriffs.com/posts/2018-01-01-sql-keys-in-depth.html

(the article is 
about keys in RDBMS's, but its statements are true for NoSQL as well)


Regards,
Kyrill


From: jpar...@gmail.com  on behalf of Javier Pareja 

Sent: Monday, March 5, 2018 1:55:14 PM
To: user@cassandra.apache.org
Subject: How do counter updates work?

Hello everyone,

I am trying to understand how cassandra counter writes work in more detail but 
all that I could find is this: 
https://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters
>From there I was able to extract the following process:
(click here to 
edit).
[cid:ii_jee3zlby0_161f5cd7cfb1234b]

PATH 1 will be much quicker than PATH 2 and its bottleneck (assuming HDD 
drives) will be the commitlog
PATH 2 will need at least an access to disk to do a read (potentially even in a 
different machine) and an access to disk to do a write to the commitlog. This 
is at least twice as slow as PATH 1.

This is all the info that I could get from the internet but a lot is missing. 
For example, there is no information about how the counter lock is acquired, is 
there a shared lock across all the nodes?

Hope I am not oversimplifying things, but I think this will be useful to better 
understand how to tune up the system.

Thanks in advance.

F Javier Pareja


Re: vnodes: high availability

2018-03-05 Thread Kyrylo Lebediev
What's the reason behind this negative effect of dynamic_snitch enabled?

Is this true for all C* versions for which this feature is implemented?

Is that because node latencies change too dynamically/sporadically while values 
is dynamic_snitch tune slower 'than required' and can't keep up with these 
changes?

In case dynamic_snitch is disabled what algorithm is used to determine which 
replica should be read  (read requests, not digest requests)?


Regards,

Kyrill


From: Jon Haddad <jonathan.had...@gmail.com> on behalf of Jon Haddad 
<j...@jonhaddad.com>
Sent: Thursday, January 18, 2018 12:49:02 AM
To: user
Subject: Re: vnodes: high availability

I *strongly* recommend disabling dynamic snitch.  I’ve seen it make latency 
jump 10x.

dynamic_snitch: false is your friend.



On Jan 17, 2018, at 2:00 PM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Avi,
If we prefer to have better balancing [like absence of hotspots during a node 
down event etc], large number of vnodes is a good solution.
Personally, I wouldn't prefer any balancing over overall resiliency  (and in 
case of non-optimal setup, larger number of nodes in a cluster decreases 
overall resiliency, as far as I understand.)

Talking about hotspots, there is a number of features helping to mitigate the 
issue, for example:
  - dynamic snitch [if a node overloaded it won't be queried]
  - throttling of streaming operations

Thanks,
Kyrill


From: Avi Kivity <a...@scylladb.com<mailto:a...@scylladb.com>>
Sent: Wednesday, January 17, 2018 2:50 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>; kurt greaves
Subject: Re: vnodes: high availability

On the flip side, a large number of vnodes is also beneficial. For example, if 
you add a node to a 20-node cluster with many vnodes, each existing node will 
contribute 5% of the data towards the new node, and all nodes will participate 
in streaming (meaning the impact on any single node will be limited, and 
completion time will be faster).

With a low number of vnodes, only a few nodes participate in streaming, which 
means that the cluster is left unbalanced and the impact on each streaming node 
is greater (or that completion time is slower).

Similarly, with a high number of vnodes, if a node is down its work is 
distributed equally among all nodes. With a low number of vnodes the cluster 
becomes unbalanced.

Overall I recommend high vnode count, and to limit the impact of failures in 
other ways (smaller number of large nodes vs. larger number of small nodes).

btw, rack-aware topology improves the multi-failure problem but at the cost of 
causing imbalance during maintenance operations. I recommend using rack-aware 
topology only if you really have racks with single-points-of-failure, not for 
other reasons.

On 01/17/2018 05:43 AM, kurt greaves wrote:
Even with a low amount of vnodes you're asking for a bad time. Even if you 
managed to get down to 2 vnodes per node, you're still likely to include double 
the amount of nodes in any streaming/repair operation which will likely be very 
problematic for incremental repairs, and you still won't be able to easily 
reason about which nodes are responsible for which token ranges. It's still 
quite likely that a loss of 2 nodes would mean some portion of the ring is down 
(at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens 
if you can; a lot of work still needs to be done to ensure smooth operation of 
C* while using vnodes, and they are much more difficult to reason about (which 
is probably the reason no one has bothered to do the math). If you're really 
keen on the math your best bet is to do it yourself, because it's not a point 
of interest for many C* devs plus probably a lot of us wouldn't remember enough 
math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a 
new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:
Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  
https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 
3.0.
Based on your info and comments in the ticket it's really a bad idea to have 
small number of vnodes for the versions using old allocation method because of 
hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 
2.1.]


Regards,
Kyrill
[CASSANDRA-7032] Improve vnode allocation - ASF 
JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org<http://issues.apache.org/>
It's been known for a little while that random vno

Re: Cassandra filter with ordering query modeling

2018-03-01 Thread Kyrylo Lebediev
Hi!

Partition key (Id in your case) must be in WHERE cause if not using indexes 
(but indexes should be used carefully, not like in case of relational DB's). 
Also, only columns which belong to primary key ( = partition key + clustering 
key) can be used in WHERE in such cases. That's why 2nd and 3rd are failing.
You might find this useful: 
http://cassandra.apache.org/doc/latest/cql/dml.html#the-where-clause

There are several Cassandra handbooks available on Amazon, maybe it would be 
helpful for you to use some of them as starting point to understand aspects of 
Cassandra data[query] modeling.


Regards,

Kyrill


From: Behroz Sikander 
Sent: Thursday, March 1, 2018 2:36:28 PM
To: user@cassandra.apache.org
Subject: Cassandra filter with ordering query modeling

Hi,own vote
favorite

I am new to Cassandra and I am trying to model a table in Cassandra. My queries 
look like the following

Query #1: select * from TableA where Id = "123"
Query #2: select * from TableA where name="test" orderby startTime DESC
Query #3: select * from TableA where state="running" orderby startTime DESC

I have been able to build the table for Query #1 which looks like

val tableAStatement = SchemaBuilder.createTable("tableA").ifNotExists.
addPartitionKey(Id, DataType.uuid).
addColumn(Name, DataType.text).
addColumn(StartTime, DataType.timestamp).
addColumn(EndTime, DataType.timestamp).
addColumn(State, DataType.text)

session.execute(tableAStatement)

but for Query#2 and 3, I have tried many different things but failed. 
Everytime, I get stuck in a different error from cassandra.

Considering the above queries, what would be the right table model? What is the 
right way to model such queries.

Regards,
Behroz


Re: JMX metrics for CL

2018-03-01 Thread Kyrylo Lebediev
Thank you so much, Eric!



From: Eric Evans <john.eric.ev...@gmail.com>
Sent: Wednesday, February 28, 2018 6:26:23 PM
To: user@cassandra.apache.org
Subject: Re: JMX metrics for CL



On Tue, Feb 27, 2018 at 2:26 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Hello!


Is it possible to get counters  from C* side regarding CQL queries executed 
since startup for each CL?
For example:
CL ONE: NNN queries
CL QUORUM: MMM queries
etc

It's possible to get a count of client requests.  You want the count attribute 
of the client-request latency histogram (closest thing to documentation here: 
http://cassandra.apache.org/doc/latest/operating/metrics.html#client-request-metrics).
  For the scope, there are more request-types than are listed in the 
documentation; For each CL there is a read/write scope, ala 
(Read|Write)-(LOCAL-QUORUM,LOCAL-ONE,QUORUM,ONE,...).

An example dashboard:

https://grafana.wikimedia.org/dashboard/db/cassandra-client-request?panelId=1=1=eqiad%20prometheus%2Fservices=restbase=All=99p

Note: This graph suppresses series with a rate of zero for the graph time-span, 
so you won't see all possible request-types.



--
Eric Evans
john.eric.ev...@gmail.com<mailto:john.eric.ev...@gmail.com>


Re: JMX metrics for CL

2018-02-28 Thread Kyrylo Lebediev
Thanks, Horia!

Wouldn't like to introduce any changes in source code.


Any alternatives how to trace CL's used from C* side?  If not 'since startup', 
then at least online metrics will be fine.


Regards,

Kyrill


From: Horia Mocioi <horia.moc...@ericsson.com>
Sent: Tuesday, February 27, 2018 7:38:23 PM
To: user@cassandra.apache.org
Subject: Re: JMX metrics for CL

Hello,

There are no such metrics that I am aware of.

One way you could do this is to have your own implementation of QueryHandler 
and your own metrics and filter the queries based on the CL and increment the 
according metric.

Then, in cassandra-env.sh you could specify to use your class using  
-Dcassandra.custom_query_handler_class.

HTH,
Horia
On tis, 2018-02-27 at 08:26 +, Kyrylo Lebediev wrote:

Hello!


Is it possible to get counters  from C* side regarding CQL queries executed 
since startup for each CL?
For example:
CL ONE: NNN queries
CL QUORUM: MMM queries
etc

Regards,

Kyrill


Re: Cassandra on high performance machine: virtualization vs Docker

2018-02-28 Thread Kyrylo Lebediev
In terms of Cassandra, rack is considered as single point of failure. So, using 
some rack-aware snitch (GossipingPropertyFileSnitch would be the best for your 
case) Cassandra won't place multiple replicas of the same range in the same 
rack.


Basically, there are two requirements that should be met:

1) Number of C* racks must be not less than RF chosen [not less than 3 for RF=3]

2) [Recommended] Number of nodes should be [more-less] the same for all racks. 
In this case data/workload will be evenly balanced across all racks and nodes.


Having requirement #1 met, failure of a rack would cause just unavailability of 
one replica of 3 for some token ranges.

Of course, queries at consistency level ALL won't work in this case, but it's 
not typical to use CL=ALL for Cassandra. BTW, which CL's  you will use?


In case you use Cassandra version >= 3.0, you may lower value for vnodes per a 
server to 32 [maybe even to 16]. This will reduce overhead for anti-entropy 
repairs, AFAIK. Reference: https://issues.apache.org/jira/browse/CASSANDRA-7032


Kind Regards,

Kyrill


From: onmstester onmstester 
Sent: Wednesday, February 28, 2018 10:11:19 AM
To: user
Subject: Re: Cassandra on high performance machine: virtualization vs Docker

Thanks
Unfortunately yes! this is a production, That's the only thing i have! and I'm 
going to use ESX (I'm not worried about throughput overhead although stress 
tests shows no problem with esx some thing like throughput of cassandra on 
single physical server < 3 * nodes on the same server)
If i use like 3 nodes per physical server, using rf=3 and config nodes on every 
physical server to be in same rac (Cassandra config), and if one of my server 
(with 3 nodes on it) failed would the data be lost ? and my application fails 
(using write consistency = 3 and read consistency =1) ?


Sent using Zoho Mail


 On Wed, 28 Feb 2018 08:13:01 +0330 daemeon reiydelle  
wrote 

Docker will provide less per node overhead.

And yes, virtualizing smaller nodes out of a bigger physical makes sense. Of 
course you lose the per node failure protection, but I guess this is not 
production?


<==>
"Who do you think made the first stone spear? The Asperger guy.
If you get rid of the autism genetics, there would be no Silicon Valley"
Temple Grandin
Daemeon C.M. Reiydelle
San Francisco 1.415.501.0198
London 44 020 8144 9872

On Tue, Feb 27, 2018 at 8:26 PM, onmstester onmstester 
> wrote:


What i've got to set up my Apache Cassandra cluster are some Servers with 20 
Core cpu * 2 Threads and 128 GB ram and 8 * 2TB disk.
Just read all over the web: Do not use big nodes for your cluster, i'm 
convinced to run multiple nodes on a single physical server.
So the question is which technology should i use: Docker or Virtualiztion 
(ESX)? Any exprience?

Sent using Zoho Mail






Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-24 Thread Kyrylo Lebediev
oh.. I'm sorry, that's my fault. Thank you for correction, Jon. While it's 
possible in theory, didn't figure out that 256-token nodes will get relatively 
more and more data with each next decom of  256-token node.

Regards,
Kyrill


From: Jon Haddad <jonathan.had...@gmail.com> on behalf of Jon Haddad 
<j...@jonhaddad.com>
Sent: Saturday, February 24, 2018 5:44:24 PM
To: user@cassandra.apache.org
Subject: Re: Is it possible / makes it sense to limit concurrent streaming 
during bootstrapping new nodes?

You can’t migrate down that way.  The last several nodes you have up will get 
completely overwhelmed, and you’ll be completely screwed.  Please do not give 
advice like this unless you’ve actually gone through the process or at least 
have an understanding of how the data will be shifted.  Adding nodes with 16 
tokens while decommissioning the ones with 256 will be absolute hell.

You can only do this by adding a new DC and retiring the old.

On Feb 24, 2018, at 2:26 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

> By the way, is it possible to migrate towards to smaller token ranges? What 
> is the recommended way doing so?
 - Didn't see this question answered. I think, be easiest way to do this is to 
add new C* nodes with lower vnodes (8, 16 instead of default 256) then decom 
old nodes with vnodes=256.

Thanks, guys, for shedding some light on this Java multithread-related 
scalability issue. BTW how to understand from JVM / OS metrics that number of 
threads for a JVM becomes a bottleneck?

Also, I'd like to add a comment: the higher number of vnodes per a node the 
lower overall reliability of the cluster. Replicas for a token range  are 
placed on the nodes responsible for next+1, next+2  ranges  (not taking into 
account NetworkTopologyStrategy / Snitch which help but seemingly not so much 
expressing in terms of probabilities). The higher number of vnodes per a node, 
the higher probability all nodes in the cluster will become 'neighbors' in 
terms of token ranges.
It's not a trivial formula for 'reliability' of C* cluster [haven't got a 
chance to do math], but in general, having a bigger number of nodes in a 
cluster (like 100 or 200), probability of 2 or more nodes are down at the same 
time increases proportionally the the number of nodes.

The most reliable C* setup is using initial_token instead of vnodes.
But this makes manageability of C* cluster worse [not so automated + there will 
hotshots in the cluster in some cases].

Remark: for  C* cluster with RF=3 any number of nodes and initial_token/vnodes 
setup there is always a possibility that simultaneous unavailability of 2(or 3, 
depending on which CL is used) nodes will lead to unavailability of a token 
range ('HostUnavailable' exception).
No miracles: reliability is mostly determined by RF number.

Which task must be solved for large clusters: "Reliability of a cluster with 
NNN nodes and RF=3 shouldn't be 'tangibly' less than reliability of 3-nodes 
cluster with RF=3"

Kind Regards,
Kyrill

From: Jürgen Albersdorfer 
<jalbersdor...@gmail.com<mailto:jalbersdor...@gmail.com>>
Sent: Tuesday, February 20, 2018 10:34:21 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Is it possible / makes it sense to limit concurrent streaming 
during bootstrapping new nodes?

Thanks Jeff,
your answer is really not what I expected to learn - which is again more manual 
doing as soon as we start really using C*. But I‘m happy to be able to learn it 
now and have still time to learn the neccessary Skills and ask the right 
questions on how to correctly drive big data with C* until we actually start 
using it, and I‘m glad to have People like you around caring about this 
questions. Thanks. This still convinces me having bet on the right horse, even 
when it might become a rough ride.

By the way, is it possible to migrate towards to smaller token ranges? What is 
the recommended way doing so? And which number of nodes is the typical ‚break 
even‘?

Von meinem iPhone gesendet

Am 20.02.2018 um 21:05 schrieb Jeff Jirsa 
<jji...@gmail.com<mailto:jji...@gmail.com>>:

The scenario you describe is the typical point where people move away from 
vnodes and towards single-token-per-node (or a much smaller number of vnodes).

The default setting puts you in a situation where virtually all hosts are 
adjacent/neighbors to all others (at least until you're way into the hundreds 
of hosts), which means you'll stream from nearly all hosts. If you drop the 
number of vnodes from ~256 to ~4 or ~8 or ~16, you'll see the number of streams 
drop as well.

Many people with "large" clusters statically allocate tokens to make it 
predictable - if you have a single token per host, you can add multiple hosts 
at a time, each streaming from a small num

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-24 Thread Kyrylo Lebediev
> By the way, is it possible to migrate towards to smaller token ranges? What 
> is the recommended way doing so?

 - Didn't see this question answered. I think, be easiest way to do this is to 
add new C* nodes with lower vnodes (8, 16 instead of default 256) then decom 
old nodes with vnodes=256.


Thanks, guys, for shedding some light on this Java multithread-related 
scalability issue. BTW how to understand from JVM / OS metrics that number of 
threads for a JVM becomes a bottleneck?

Also, I'd like to add a comment: the higher number of vnodes per a node the 
lower overall reliability of the cluster. Replicas for a token range  are 
placed on the nodes responsible for next+1, next+2  ranges  (not taking into 
account NetworkTopologyStrategy / Snitch which help but seemingly not so much 
expressing in terms of probabilities). The higher number of vnodes per a node, 
the higher probability all nodes in the cluster will become 'neighbors' in 
terms of token ranges.

It's not a trivial formula for 'reliability' of C* cluster [haven't got a 
chance to do math], but in general, having a bigger number of nodes in a 
cluster (like 100 or 200), probability of 2 or more nodes are down at the same 
time increases proportionally the the number of nodes.


The most reliable C* setup is using initial_token instead of vnodes.

But this makes manageability of C* cluster worse [not so automated + there will 
hotshots in the cluster in some cases].


Remark: for  C* cluster with RF=3 any number of nodes and initial_token/vnodes 
setup there is always a possibility that simultaneous unavailability of 2(or 3, 
depending on which CL is used) nodes will lead to unavailability of a token 
range ('HostUnavailable' exception).

No miracles: reliability is mostly determined by RF number.


Which task must be solved for large clusters: "Reliability of a cluster with 
NNN nodes and RF=3 shouldn't be 'tangibly' less than reliability of 3-nodes 
cluster with RF=3"


Kind Regards,

Kyrill


From: Jürgen Albersdorfer 
Sent: Tuesday, February 20, 2018 10:34:21 PM
To: user@cassandra.apache.org
Subject: Re: Is it possible / makes it sense to limit concurrent streaming 
during bootstrapping new nodes?

Thanks Jeff,
your answer is really not what I expected to learn - which is again more manual 
doing as soon as we start really using C*. But I‘m happy to be able to learn it 
now and have still time to learn the neccessary Skills and ask the right 
questions on how to correctly drive big data with C* until we actually start 
using it, and I‘m glad to have People like you around caring about this 
questions. Thanks. This still convinces me having bet on the right horse, even 
when it might become a rough ride.

By the way, is it possible to migrate towards to smaller token ranges? What is 
the recommended way doing so? And which number of nodes is the typical ‚break 
even‘?

Von meinem iPhone gesendet

Am 20.02.2018 um 21:05 schrieb Jeff Jirsa 
>:

The scenario you describe is the typical point where people move away from 
vnodes and towards single-token-per-node (or a much smaller number of vnodes).

The default setting puts you in a situation where virtually all hosts are 
adjacent/neighbors to all others (at least until you're way into the hundreds 
of hosts), which means you'll stream from nearly all hosts. If you drop the 
number of vnodes from ~256 to ~4 or ~8 or ~16, you'll see the number of streams 
drop as well.

Many people with "large" clusters statically allocate tokens to make it 
predictable - if you have a single token per host, you can add multiple hosts 
at a time, each streaming from a small number of neighbors, without overlap.

It takes a bit more tooling (or manual token calculation) outside of cassandra, 
but works well in practice for "large" clusters.




On Tue, Feb 20, 2018 at 4:42 AM, Jürgen Albersdorfer 
> wrote:
Hi, I'm wondering if it is possible resp. would it make sense to limit 
concurrent streaming when joining a new node to cluster.

I'm currently operating a 15-Node C* Cluster (V 3.11.1) and joining another 
Node every day.
The 'nodetool netstats' shows it always streams data from all other nodes.

How far will this scale? - What happens when I have hundrets or even thousends 
of Nodes?

Has anyone experience with such a Situation?

Thanks, and regards
Jürgen



Re: Missing 3.11.X cassandra debian packages

2018-02-24 Thread Kyrylo Lebediev
Yes, it's a 'known issue' with the style of Debian repos management.

Personally, I don't understand why all versions of except the latest one are 
intentionally removed from repo index.
Because if we had several versions of a package in the repo, anyway 'apt-get 
install cassandra cassandra-tools' would choose the latest one by default.

(You may experiment with 'apt-cache policy ')


This approach with leaving the latest version forces end-users to create and 
maintain local debian repos. Because if we already have a cluster 3.11.1 and 
would like to all a node there, we can't install C* 3.11.1 by apt-get from C* 
official  repo.


For example, Datadog team keeps all versions of their packages in repo index: 
https://s3.amazonaws.com/apt.datadoghq.com/dists/stable/main/binary-amd64/Packages,
 and in case some particular version of dd-agent is needed users still have 
ability to install it by specifying it explicitly, like this: 'apt-get install 
=', while 'apt-get install ' will install the latest version 
of the package.


Nothing to say, in case of C* version does matters.


So, I'd say this 'feature' is a bad feature.


Regards,

Kyrill


From: Michael Shuler  on behalf of Michael Shuler 

Sent: Wednesday, February 21, 2018 8:18:21 PM
To: user@cassandra.apache.org
Subject: Re: Missing 3.11.X cassandra debian packages

On 02/21/2018 11:56 AM, Zachary Marois wrote:
> Starting in that last two weeks (I successfully installed cassandra
> sometime in the last two weeks), I'm guessing on 2/19 when version
> 3.11.2 was released, the cassandra apt package version 3.11.1 became
> unstable. It doesn't seem to be published in the
> http://www.apache.org/dist/cassandra/debian repository anymore (at least
> not in a valid state).
>
>
> Despite the package actually being in the repository still
>
> http://www.apache.org/dist/cassandra/debian/pool/main/c/cassandra/cassandra_3.11.1_all.deb
>
> It is no longer in the Packages list

This is just the way reprepro works - only the latest version will be
reported.

This is a feature. Users really should be installing the latest release
version.

> http://dl.bintray.com/apache/cassandra/dists/311x/main/binary-amd64/Packages
>
> It looks like version 3.11.2 was released on 2/19.
>
> I'm guessing that publishing dropped the 3.11.1 version from the
> packages list.

This happens for every release. Every so often, branches will be culled
from http://www.apache.org/dist/, when they are no longer supported, so
periodically, complete series of packages will disappear. However, they
will always be available from the canonical Apache release repository.

The canonical release repository for Apache projects is
archive.apache.org. Every release artifact of the Apache Cassandra
project appears at:

  http://archive.apache.org/dist/cassandra/

A debian package user that cannot upgrade to the latest version via
apt-get can always use, for example wget to fetch .deb. files they need
from the repo pool dir:

  http://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/

You will not find Cassandra 0.7.9 in the "current" apt repositories any
longer, since they are unsupported, but there are indeed people using
that version. The above spot is where to find the .deb packages for
0.7.9, and all older releases.

--
Warm regards,
Michael

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-20 Thread Kyrylo Lebediev
I'd say, "add new DC, then remove old DC" approach is more risky especially if 
they use QUORUM CL (in this case they will need to change CL to LOCAL_QUORUM, 
otherwise they'll run into a lot of blocking read repairs).
Also, if there is a chance to get rid of streaming, it worth doing as usually 
direct data copy (not by means of C*) is more effective and less troublesome.

Regards,
Kyrill


From: Nitan Kainth <nitankai...@gmail.com>
Sent: Wednesday, February 21, 2018 1:04:05 AM
To: user@cassandra.apache.org
Subject: Re: Best approach to Replace existing 8 smaller nodes in production 
cluster with New 8 nodes that are bigger in capacity, without a downtime

You can also create a new DC and then terminate old one.

Sent from my iPhone

> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev <kyrylo_lebed...@epam.com> wrote:
>
> Hi,
> Consider using this approach, replacing nodes one by one: 
> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/
>
> Regards,
> Kyrill
>
> 
> From: Leena Ghatpande <lghatpa...@hotmail.com>
> Sent: Tuesday, February 20, 2018 10:24:24 PM
> To: user@cassandra.apache.org
> Subject: Best approach to Replace existing 8 smaller nodes in production 
> cluster with New 8 nodes that are bigger in capacity, without a downtime
>
> Best approach to replace existing 8 smaller 8 nodes in production cluster 
> with New 8 nodes that are bigger in capacity without a downtime
>
> We have 4 nodes each in 2 DC, and we want to replace these 8 nodes with new 8 
> nodes that are bigger in capacity in terms of RAM,CPU and Diskspace without a 
> downtime.
> The RF is set to 3 currently, and we have 2 large tables with upto 70Million 
> rows
>
> What would be the best approach to implement this
> - Add 1 New Node and Decomission 1 Old node at a time?
> - Add all New nodes to the cluster, and then decommission old nodes ?
> If we do this, can we still keep the RF=3 while we have 16 nodes at a 
> point in the cluster before we start decommissioning?
>- How long do we wait in between adding a Node or decomissiing to ensure 
> the process is complete before we proceed?
>- Any tool that we can use to monitor if the add/decomission node is done 
> before we proceed to next
>
> Any other suggestion?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-20 Thread Kyrylo Lebediev
Hi,
Consider using this approach, replacing nodes one by one: 
https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/

Regards,
Kyrill


From: Leena Ghatpande 
Sent: Tuesday, February 20, 2018 10:24:24 PM
To: user@cassandra.apache.org
Subject: Best approach to Replace existing 8 smaller nodes in production 
cluster with New 8 nodes that are bigger in capacity, without a downtime

Best approach to replace existing 8 smaller 8 nodes in production cluster with 
New 8 nodes that are bigger in capacity without a downtime

We have 4 nodes each in 2 DC, and we want to replace these 8 nodes with new 8 
nodes that are bigger in capacity in terms of RAM,CPU and Diskspace without a 
downtime.
The RF is set to 3 currently, and we have 2 large tables with upto 70Million 
rows

What would be the best approach to implement this
 - Add 1 New Node and Decomission 1 Old node at a time?
 - Add all New nodes to the cluster, and then decommission old nodes ?
 If we do this, can we still keep the RF=3 while we have 16 nodes at a 
point in the cluster before we start decommissioning?
- How long do we wait in between adding a Node or decomissiing to ensure 
the process is complete before we proceed?
- Any tool that we can use to monitor if the add/decomission node is done 
before we proceed to next

Any other suggestion?


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Kyrylo Lebediev
Agree with you, Daniel, regarding gaps in documentation.


---

At the same time I disagree with the folks who are complaining in this thread 
about some functionality like 'advanced backup' etc is missing out of the box.

We all live in the time where there are literally tons of open-source tools 
(automation, monitoring) and languages are available, also there are some 
really powerful SaaS solutions on the market which support C* (Datadog, for 
instance).


For example, while C* provides basic building blocks for anti-entropy repairs 
[I mean basic usage of 'nodetool repair' is not suitable for large production 
clusters], Reaper (many thanks to Spotify and TheLastPickle!) which uses this 
basic functionality solves the  task very well for real-world C* setups.


Something is missing  / could be improved in your opinion - we're in era of 
open-source. Create your own tool, let's say for C* backups automation using 
EBS snapshots, and upload it on GitHub.


C* is a DB-engine, not a fully-automated self-contained suite.
End-users are able to work on automation of routine [3rd party projects], 
meanwhile C* contributors may focus on core functionality.

--

Going back to documentation topic, as far as I understand, DataStax is no 
longer main C* contributor  and is focused on own C*-based proprietary software 
[correct me smb if I'm wrong].

This has led us to the situation when development of C* is progressing (as far 
as I understand, work is done mainly by some large C* users having enough 
resources to contribute to the C* project to get the features they need), but 
there is no single company which has taken over actualization of C* 
documentation / Wiki.

Honestly, even DataStax's documentation is  too concise and  is missing a lot 
of important details.

[BTW, just've taken a look at https://cassandra.apache.org/doc/latest/ and it 
looks not that 'bad':  despite of TODOs it contains a lot of valuable 
information]


So, I feel the C* Community has to join efforts on enriching existing 
documentation / resurrection of Wiki [where can be placed howto's, information 
about 3rd party automations and integrations etc].

By the Community I mean all of us including myself.



Regards,

Kyrill


From: Daniel Hölbling-Inzko 
Sent: Tuesday, February 20, 2018 11:28:13 AM
To: user@cassandra.apache.org; James Briggs
Cc: d...@cassandra.apache.org
Subject: Re: Cassandra Needs to Grow Up by Version Five!

Hi,

I have to add my own two cents here as the main thing that keeps me from really 
running Cassandra is the amount of pain running it incurs.
Not so much because it's actually painful but because the tools are so 
different and the documentation and best practices are scattered across a dozen 
outdated DataStax articles and this mailing list etc.. We've been hesitant 
(although our use case is perfect for using Cassandra) to deploy Cassandra to 
any critical systems as even after a year of running it we still don't have the 
operational experience to confidently run critical systems with it.

Simple things like a foolproof / safe cluster-wide S3 Backup (like 
Elasticsearch has it) would for example solve a TON of issues for new people. I 
don't need it auto-scheduled or something, but having to configure cron jobs 
across the whole cluster is a pain in the ass for small teams.
To be honest, even the way snapshots are done right now is already super 
painful. Every other system I operated so far will just create one backup 
folder I can export, in C* the Backup is scattered across a bunch of different 
Keyspace folders etc.. needless to say that it took a while until I trusted my 
backup scripts fully.

And especially for a Database I believe Backup/Restore needs to be a non-issue 
that's documented front and center. If not smaller teams just don't have the 
resources to dedicate to learning and building the tools around it.

Now that the team is getting larger we could spare the resources to operate 
these things, but switching from a well-understood RDBMs schema to Cassandra is 
now incredibly hard and will probably take years.

greetings Daniel

On Tue, 20 Feb 2018 at 05:56 James Briggs  
wrote:
Kenneth:

What you said is not wrong.

Vertica and Riak are examples of distributed databases that don't require 
hand-holding.

Cassandra is for Java-programmer DIYers, or more often Datastax clients, at 
this point.
Thanks, James.


From: Kenneth Brotman 
To: user@cassandra.apache.org
Cc: d...@cassandra.apache.org
Sent: Monday, February 19, 2018 4:56 PM

Subject: RE: Cassandra Needs to Grow Up by Version Five!

Jeff, you helped me figure out what I was missing.  It just took me a day to 
digest what you wrote.  I’m coming over from another type of engineering.  I 
didn’t know and it’s not 

Re: Hints folder missing in Cassandra

2018-02-07 Thread Kyrylo Lebediev
Hi,

Not sure what could be the reason, but the issue is obvious: directory 
/var/lib/cassandra/hints is missing or cassandra user doesn't have enough 
permissions on it.


Regards,

Kyrill


From: test user 
Sent: Wednesday, February 7, 2018 8:14:59 AM
To: user@cassandra.apache.org
Subject: Fwd: Hints folder missing in Cassandra


Has anyone run into the issue mentioned below?


-- Forwarded message --
From: test user >
Date: Tue, Feb 6, 2018 at 3:28 PM
Subject: Hints folder missing in Cassandra
To: 
user-subscr...@cassandra.apache.org



Hello All,



I am using Cassandra 3.10. I have 3-node cluster (say node0, node1, node2).

I have been using nodetool utility to monitor the cluster.

nodetool status – indicates all nodes are up and in normal state

nodetool ring – indicates all nodes are up



Running into following issue:

WARN  [HintsWriteExecutor:1] 2018-02-05 06:57:09,607 CLibrary.java:295 - 
open(/var/lib/cassandra/hints, O_RDONLY) failed, errno (2).

For some reason, the hints folder on node1 is missing. Not sure how it got 
deleted. One node in the cluster always seems to run into this issue.

The other two nodes node0 and node2 do not have this issue, and the hints 
folder is present.



The debug logs do not seem to indicate the root cause.



Please let me know how to debug this further to understand the root cause.



Regards,

Cassandra User





Re: Cassandra 2.1: replace running node without streaming

2018-02-03 Thread Kyrylo Lebediev
I've found modified Carlos' article (more recent than that I was referring to) 
and this one contains the same method as you described, Oleksandr:

https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement


Thank you for your readiness to help!

Kind Regards,

Kyrill


From: Kyrylo Lebediev <kyrylo_lebed...@epam.com>
Sent: Saturday, February 3, 2018 12:23:15 PM
To: User
Subject: Re: Cassandra 2.1: replace running node without streaming


Thank you Oleksandr,

Just tested on 3.11.1 and it worked for me (you may see the logs below).

Just comprehended that there is one important prerequisite this method to work: 
new node MUST be located in the same rack (in terms of C*) as the old one. 
Otherwise correct replicas placement order will be violated (I mean when 
replicas of the same token range should be placed in different racks).

Anyway, even having successful run of node replacement in sandbox I'm still in 
doubt.

Just wondering why this procedure which seems to be much easier than 
[add/remove node] or [replace a node] which are documented ways for live node 
replacement, has never been included into documentation.

Does anybody in the ML know the reason for this?


Also, for some reason in his article Carlos drops files of system keyspace 
(which contains system.local table):

In the new node, delete all system tables except for the schema ones. This will 
ensure that the new Cassandra node will not have any corrupt or previous 
configuration assigned.

  1.  sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema | 
xargs -I {} sudo rm -rf {}


http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/
[Carlos, if you are here might you, please, comment ]

So still a mystery to me.

-
Logs for 3.1.11

-

== Before:

--  Address   Load   Tokens   Owns (effective)  Host ID 
  Rack
UN  10.10.10.222  256.61 KiB  3100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
UN  10.10.10.223  225.65 KiB  3100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1 <<<<<<
UN  10.10.10.221  187.39 KiB  3100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1

=== After:
--  Address   Load   Tokens   Owns (effective)  Host ID 
  Rack
UN  10.10.10.222  245.84 KiB  3100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
UN  10.10.10.221  192.8 KiB  3100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1
UN  10.10.10.224  266.61 KiB  3100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1  <<<<<



== Logs from another node (10.10.10.221):
INFO  [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397 
OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224
INFO  [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node 
/10.10.10.224 is now part of the cluster
INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
InetAddress /10.10.10.224 is now UP
INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
InetAddress /10.10.10.224 is now UP
WARN  [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host 
ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 and 
/10.10.10.224; /10.10.10.224 is the new owner
INFO  [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient 
/10.10.10.223 has been silent for 3ms, removing from gossip

== Logs from new node:
INFO  [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: Finish 
joining ring
INFO  [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node 
/10.10.10.223 is now part of the cluster
WARN  [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not 
updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 because 
it's mine
INFO  [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token -7774421781914237508.  
Ignoring /10.10.10.223
INFO  [GossipStage:1] 2018-02-03 11:33:02,686 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token 2257660731441815305.  
Ignoring /10.10.10.223
INFO  [GossipStage:1] 2018-02-03 11:33:02,692 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token 51879124242594885.  
Ignoring /10.10.10.223
WARN  [GossipTasks:1] 2018-02-03 11:33:03,985 Gossiper.java:789 - Gossip stage 
has 5 pending tasks; skipping status check (no nodes will be marked down)
INFO  [main] 2018-02-03 11:33:04,394 SecondaryIndexManager.java:509 - Executing 
pre-join tasks for: CFS(Keyspace='test', ColumnFamily='usr')
WARN  [GossipTasks:1] 2018-02-03 11:33:05,088 Gossiper.java:789 - Gossip stage 
has 7 pending tasks; skipping status check (no 

Re: Cassandra 2.1: replace running node without streaming

2018-02-03 Thread Kyrylo Lebediev
.10.222 is now DOWN<<<<< have no idea why this appeared in logs
INFO  [main] 2018-02-03 11:33:20,566 NativeTransportService.java:70 - Netty 
using native Epoll event loop
INFO  [HANDSHAKE-/10.10.10.222] 2018-02-03 11:33:20,714 
OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.222




Kind Regards,
Kyrill



From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Sent: Saturday, February 3, 2018 10:44:26 AM
To: User
Subject: Re: Cassandra 2.1: replace running node without streaming

On 3 Feb 2018 08:49, "Jürgen Albersdorfer" 
<jalbersdor...@gmail.com<mailto:jalbersdor...@gmail.com>> wrote:
Cool, good to know. Do you know this is still true for 3.11.1?

Well, I've never tried with that specific version, but this is pretty 
fundamental, so I would expect it to work the same way. Test in isolation if 
you want to be sure, though.

I don't think this is documented anywhere, however, since I had the same doubts 
before seeing it worked for the first time.

--
Alex

Am 03.02.2018 um 08:19 schrieb Oleksandr Shulgin 
<oleksandr.shul...@zalando.de<mailto:oleksandr.shul...@zalando.de>>:

On 3 Feb 2018 02:42, "Kyrylo Lebediev" 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thanks, Oleksandr,
In my case I'll need to replace all nodes in the cluster (one-by-one), so 
streaming will introduce perceptible overhead.
My question is not about data movement/copy itself, but more about all this 
token magic.

Okay, let's say we stopped old node, moved data to new node.
Once it's started with auto_bootstrap=false it will be added to the cluster 
like an usual node, just skipping streaming stage, right?

For a cluster with vnodes enabled, during addition of new node its token ranges 
are calculated automatically by C* on startup.

So, how will C* know that this new node must be responsible for exactly the 
same token ranges as the old node was?
How would the rest of nodes in the cluster ('peers') figure out that old node 
should be replaced in ring by the new one?

Do you know about some  limitation for this process in case of C* 2.1.x with 
vnodes enabled?

A node stores its tokens and host id in the system.local table. Next time it 
starts up, it will use the same tokens as previously and the host id allows the 
rest of the cluster to see that it is the same node and ignore the IP address 
change. This happens regardless of auto_bootstrap setting.

Try "select * from system.local" to see what is recorded for the old node. When 
the new node starts up it should log "Using saved tokens" with the list of 
numbers. Other nodes should log something like "ignoring IP address change" for 
the affected node addresses.

Be careful though, to make sure that you put the data directory exactly where 
the new node expects to find it: otherwise it might just join as a brand new 
one, allocating new tokens. As a precaution it helps to ensure that the system 
user running the Cassandra process has no permission to create the data 
directory: this should stop the startup in case of misconfiguration.

Cheers,
--
Alex




Re: Cassandra 2.1: replace running node without streaming

2018-02-02 Thread Kyrylo Lebediev
Thanks, Oleksandr,
In my case I'll need to replace all nodes in the cluster (one-by-one), so 
streaming will introduce perceptible overhead.
My question is not about data movement/copy itself, but more about all this 
token magic.

Okay, let's say we stopped old node, moved data to new node.
Once it's started with auto_bootstrap=false it will be added to the cluster 
like an usual node, just skipping streaming stage, right?

For a cluster with vnodes enabled, during addition of new node its token ranges 
are calculated automatically by C* on startup.

So, how will C* know that this new node must be responsible for exactly the 
same token ranges as the old node was?
How would the rest of nodes in the cluster ('peers') figure out that old node 
should be replaced in ring by the new one?

Do you know about some  limitation for this process in case of C* 2.1.x with 
vnodes enabled?

Regards,
Kyrill


From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Sent: Friday, February 2, 2018 4:26:30 PM
To: User
Subject: Re: Cassandra 2.1: replace running node without streaming

On Fri, Feb 2, 2018 at 3:15 PM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Hello All!

I've got a pretty standard task - to replace a running C* node [version 2.1.15, 
vnodes=256, Ec2Snitch] (IP address will change after replacement, have no 
control over it).

There are 2 ways stated in C* documentation how this can be done:

1) Add a new node, than 'nodetool decommission' [ = 2 data streaming + 2 token 
range recalculations],

2) Stop the node then replace it by setting -Dcassandra.replace_address [ = 1 
data streaming]
Unfortunately, both these methods imply data streaming.

Is there a supported solution how to replace a live healthy node without data 
streaming / bootstrapping?
Something like: "Stop old node, copy data to new node, start new node with 
auto_bootstrap=false etc..."

On EC2, if you're using EBS it's pretty easy: drain and stop the old node, 
attach the volume to the new one and start it.
If not using EBS, then you have to copy the data to the new node before it is 
started.


I was able to find a couple manuals on the Internet, like this one: 
http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/, but 
not having understanding of C* internals, I don't know if such hacks are safe.

More or less like that: rsync while the old node is still running, then stop 
the node and rsync again.

But given all the hassle, streaming with replace_address doesn't sound too 
costly to me.

Cheers,
--
Alex



Cassandra 2.1: replace running node without streaming

2018-02-02 Thread Kyrylo Lebediev
Hello All!

I've got a pretty standard task - to replace a running C* node [version 2.1.15, 
vnodes=256, Ec2Snitch] (IP address will change after replacement, have no 
control over it).

There are 2 ways stated in C* documentation how this can be done:

1) Add a new node, than 'nodetool decommission' [ = 2 data streaming + 2 token 
range recalculations],

2) Stop the node then replace it by setting -Dcassandra.replace_address [ = 1 
data streaming]
Unfortunately, both these methods imply data streaming.

Is there a supported solution how to replace a live healthy node without data 
streaming / bootstrapping?
Something like: "Stop old node, copy data to new node, start new node with 
auto_bootstrap=false etc..."

I was able to find a couple manuals on the Internet, like this one: 
http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/, but 
not having understanding of C* internals, I don't know if such hacks are safe.


Regards,

Kyrill



Re: Unable to change IP address from private to public

2018-01-20 Thread Kyrylo Lebediev
Haven't tried this by myself, but I suspect that in order to add public 
addresses for existing nodes for inter-dc communication, following method might 
work (just my guess):

 - take a node, shutdown C* there, change it's private IP, add  public IP, 
update cassandra.yaml accordingly (the same way as you described +set new 
private IP) and replace 'old node' as described here: 
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/operations/opsReplaceNode.html
 - do the same for the rest of nodes, one by one.


But it's better to test this approach somewhere in sandbox.

Also, AFAIK, only *listen* parameters have influence on how C* builds its ring. 
*rpc* settings are for communications b/w clients and Cassandra.


Regards,

Kyrill

Replacing a dead node or dead seed 
node
docs.datastax.com
Steps to replace a node that has died for some reason, such as hardware failure.




From: Thomas Goossens 
Sent: Friday, January 19, 2018 2:18:14 PM
To: user@cassandra.apache.org
Subject: Unable to change IP address from private to public

Hello,

For an 8-node Cassandra 2.2.11 cluster which is spread over two datacenters, I 
am looking to change from using private IP addresses to the combination of 
private and public IP addresses and interfaces. The cluster uses the 
GossipingPropertyFileSnitch.

The aim is to end up with each peer being known by their public IP address, but 
communicating over the private network where possible. We are looking to make 
this change in rolling fashion, so without taking down the entire cluster.

To this end, I have made the following configuration changes:

  1.  cassandra-rackdc.properties: add the "prefer_local=true" option
  2.  cassandra.yaml:
- change the broadcast_address from the private address to the public address
- add "listen_on_broadcast_address: true"
- leave the "listen_address" to be the private IP address
- change the broadcast_rpc_address from the private address to the public 
address
  3.  infrastructure:
- amend firewall settings to allow TCP traffic between peers on all interfaces 
for the relevant ports

Upon trying to make the described changes on a non-seed node, the following 
happens:
- The node appears to start up normally
- Upon running nodetool status on the changed node, all peers appear to be 
down, except the local node:

Datacenter: DC-A
=
DN  10.x.x.x
DN  10.x.x.x
DN  10.x.x.x
DN  10.x.x.x
DN  10.x.x.x

Datacenter: DC-B
=
DN  10.x.x.x
DN  10.x.x.x
UN  123.x.x.x

- Upon running nodetool status on any other node, all peers appear to be up, 
except the changed node which is still known under its old IP address:

Datacenter: DC-A
=
UN  10.x.x.x
UN  10.x.x.x
UN  10.x.x.x
UN  10.x.x.x
UN  10.x.x.x

Datacenter: DC-B
=
UN  10.x.x.x
UN  10.x.x.x
DN  10.x.x.x

Reverting the changes in cassandra.yaml and restarting that node causes the 
cluster to go back to normal. I have tried various combinations of 
private/public IP address settings, all to no avail.

I have successfully set up a similar configuration in a test cluster, but I 
have had to bring down the entire cluster in order to get it to work.

My question is: is it possible to make such a change in a phased way without 
bringing down the entire cluster? If yes, what is the best approach?

Many thanks in advance.

Thomas



--



Thomas Goossens
CTO


[Drillster]

Rijnzathe 16
3454 PV De Meern
Netherlands
+31 88 375 0500
www.drillster.com





Re: vnodes: high availability

2018-01-17 Thread Kyrylo Lebediev
Avi,

If we prefer to have better balancing [like absence of hotspots during a node 
down event etc], large number of vnodes is a good solution.

Personally, I wouldn't prefer any balancing over overall resiliency  (and in 
case of non-optimal setup, larger number of nodes in a cluster decreases 
overall resiliency, as far as I understand.)


Talking about hotspots, there is a number of features helping to mitigate the 
issue, for example:

  - dynamic snitch [if a node overloaded it won't be queried]

  - throttling of streaming operations

Thanks,
Kyrill


From: Avi Kivity <a...@scylladb.com>
Sent: Wednesday, January 17, 2018 2:50 PM
To: user@cassandra.apache.org; kurt greaves
Subject: Re: vnodes: high availability


On the flip side, a large number of vnodes is also beneficial. For example, if 
you add a node to a 20-node cluster with many vnodes, each existing node will 
contribute 5% of the data towards the new node, and all nodes will participate 
in streaming (meaning the impact on any single node will be limited, and 
completion time will be faster).


With a low number of vnodes, only a few nodes participate in streaming, which 
means that the cluster is left unbalanced and the impact on each streaming node 
is greater (or that completion time is slower).


Similarly, with a high number of vnodes, if a node is down its work is 
distributed equally among all nodes. With a low number of vnodes the cluster 
becomes unbalanced.


Overall I recommend high vnode count, and to limit the impact of failures in 
other ways (smaller number of large nodes vs. larger number of small nodes).


btw, rack-aware topology improves the multi-failure problem but at the cost of 
causing imbalance during maintenance operations. I recommend using rack-aware 
topology only if you really have racks with single-points-of-failure, not for 
other reasons.

On 01/17/2018 05:43 AM, kurt greaves wrote:
Even with a low amount of vnodes you're asking for a bad time. Even if you 
managed to get down to 2 vnodes per node, you're still likely to include double 
the amount of nodes in any streaming/repair operation which will likely be very 
problematic for incremental repairs, and you still won't be able to easily 
reason about which nodes are responsible for which token ranges. It's still 
quite likely that a loss of 2 nodes would mean some portion of the ring is down 
(at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens 
if you can; a lot of work still needs to be done to ensure smooth operation of 
C* while using vnodes, and they are much more difficult to reason about (which 
is probably the reason no one has bothered to do the math). If you're really 
keen on the math your best bet is to do it yourself, because it's not a point 
of interest for many C* devs plus probably a lot of us wouldn't remember enough 
math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a 
new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  
https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 
3.0.

Based on your info and comments in the ticket it's really a bad idea to have 
small number of vnodes for the versions using old allocation method because of 
hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 
2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF 
JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org<http://issues.apache.org>
It's been known for a little while that random vnode allocation causes hotspots 
of ownership. It should be possible to improve dramatically on this with 
deterministic ...



From: Jon Haddad <jonathan.had...@gmail.com<mailto:jonathan.had...@gmail.com>> 
on behalf of Jon Haddad <j...@jonhaddad.com<mailto:j...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 8:21:33 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  
There’s going to be some imbalance, the amount of imbalance depends on luck, 
unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the 
ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately f

Re: vnodes: high availability

2018-01-17 Thread Kyrylo Lebediev
Kurt, thanks for your recommendations.

Make sense.


Yes, we're planning to migrate the cluster and change endpoint-snitch to 
"AZ-aware" one.

Unfortunately, I'm 'not good enough' in math, have to think of how to calculate 
probabilities for the case of vnodes (whereas the case "without vnodes" should 
be easy to calculate: just a bit of combinatorics). Not an easy task for me, 
but will try to get at least some estimations.

Still believe that having formulas (results of doing math) we could come up 
with 'better' best-practices than are currently stated in C* documentation.


--

In particular, as far as I understand, probsbility of losing a keyrange [for 
CL=QUORUM] for a cluster with vnodes=256 and SimpleSnitch and total number of 
physical nodes not much more than 256 [not a lot of such large clusters..] 
equals to:

P1 = C(Nnodes, 2)*p^2 = Nnodes*(Nnodes-1)/2  *p^2


where :

p - failure probability for a server,

C(Nnodes, 2) - combination any 2 nodes  
[https://en.wikipedia.org/wiki/Combination]


Whereas probability of losing a keyrange for a non-vnode cluster is:

P2 = 2*Nnodes*p^2


So, 'old good' non-vnodes cluster is more reliable than 'new-style' vnodes 
cluster.

Correct?


Would like to get similar results for more realistic cases.  Will be back here 
once I get them (hoping to get)


Regards,

Kyrill


Combination - Wikipedia<https://en.wikipedia.org/wiki/Combination>
en.wikipedia.org
In mathematics, a combination is a selection of items from a collection, such 
that (unlike permutations) the order of selection does not matter.







From: kurt greaves <k...@instaclustr.com>
Sent: Wednesday, January 17, 2018 5:43:06 AM
To: User
Subject: Re: vnodes: high availability

Even with a low amount of vnodes you're asking for a bad time. Even if you 
managed to get down to 2 vnodes per node, you're still likely to include double 
the amount of nodes in any streaming/repair operation which will likely be very 
problematic for incremental repairs, and you still won't be able to easily 
reason about which nodes are responsible for which token ranges. It's still 
quite likely that a loss of 2 nodes would mean some portion of the ring is down 
(at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens 
if you can; a lot of work still needs to be done to ensure smooth operation of 
C* while using vnodes, and they are much more difficult to reason about (which 
is probably the reason no one has bothered to do the math). If you're really 
keen on the math your best bet is to do it yourself, because it's not a point 
of interest for many C* devs plus probably a lot of us wouldn't remember enough 
math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a 
new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  
https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 
3.0.

Based on your info and comments in the ticket it's really a bad idea to have 
small number of vnodes for the versions using old allocation method because of 
hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 
2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF 
JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org<http://issues.apache.org>
It's been known for a little while that random vnode allocation causes hotspots 
of ownership. It should be possible to improve dramatically on this with 
deterministic ...



From: Jon Haddad <jonathan.had...@gmail.com<mailto:jonathan.had...@gmail.com>> 
on behalf of Jon Haddad <j...@jonhaddad.com<mailto:j...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 8:21:33 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  
There’s going to be some imbalance, the amount of imbalance depends on luck, 
unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the 
ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for 
him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax 
client used.

Below in the thread Alex mentioned that it'

Re: question about nodetool decommission

2018-01-17 Thread Kyrylo Lebediev
Hi Jerome,

I don't know reason for this, but compactions run during  'nodetool 
decommission'.


What C* version you are working with?

What is the reason you're decommissioning the node? Any issues with it?

Can you see any errors/warnings in system.log on the node being decommissioned?

Pending tasks --- you mean output of 'nodetool tpstats'?

Could you please send output of 'nodetool netstats' from the node you're trying 
to evict from the cluster.


Regards,

Kyrill



From: Jerome Basa 
Sent: Wednesday, January 17, 2018 6:56:10 PM
To: user@cassandra.apache.org
Subject: question about nodetool decommission

hi,

am currently decommissioning a node and monitoring it using `nodetool
netstats`. i’ve noticed that it hasn’t started streaming any data and
it’s doing compaction (like 600+ pending tasks). the node is marked as
“UL” when i run `nodetool status`.

has anyone seen like this before? am thinking of stopping C* and run
`nodetool removenode`. also, can i add a new node to the cluster while
one node is marked as “UL” (decommissioning)? thanks

regards,
-jerome

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  
https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 
3.0.

Based on your info and comments in the ticket it's really a bad idea to have 
small number of vnodes for the versions using old allocation method because of 
hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 
2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF 
JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org
It's been known for a little while that random vnode allocation causes hotspots 
of ownership. It should be possible to improve dramatically on this with 
deterministic ...



From: Jon Haddad <jonathan.had...@gmail.com> on behalf of Jon Haddad 
<j...@jonhaddad.com>
Sent: Tuesday, January 16, 2018 8:21:33 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  
There’s going to be some imbalance, the amount of imbalance depends on luck, 
unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the 
ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for 
him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax 
client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a 
value lower than 256 only for C* version > 3.0 (token allocation algorithm was 
improved since C* 3.0) .

Jon,
Do you have positive experience setting up  cluster with vnodes < 256 for  C* 
2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers 
per AZ in order to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

Thanks,
Kyrill

From: Jon Haddad <jonathan.had...@gmail.com<mailto:jonathan.had...@gmail.com>> 
on behalf of Jon Haddad <j...@jonhaddad.com<mailto:j...@jonhaddad.com>>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in 
the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you 
would be able to withstand the complete failure of an entire availability zone 
at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we 
recommend 32.  I’m curious how things behave going as low as 4, personally, but 
I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.
In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.

Regards,
Kyrill

From: kurt greaves <k...@instaclustr.com<mailto:k...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unf

Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
Agree with you, Jon.

Actually, this cluster was configured by my 'predecessor' and [fortunately for 
him] we've never met :)

We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax 
client used.


Below in the thread Alex mentioned that it's recommended to set vnodes to a 
value lower than 256 only for C* version > 3.0 (token allocation algorithm was 
improved since C* 3.0) .


Jon,

Do you have positive experience setting up  cluster with vnodes < 256 for  C* 
2.1?


vnodes=32 also too high, as for me (we need to have much more than 32 servers 
per AZ in order to to get 'reliable' cluster)

vnodes=4 seems to be better from HA + balancing trade-off


Thanks,

Kyrill


From: Jon Haddad <jonathan.had...@gmail.com> on behalf of Jon Haddad 
<j...@jonhaddad.com>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in 
the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you 
would be able to withstand the complete failure of an entire availability zone 
at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we 
recommend 32.  I’m curious how things behave going as low as 4, personally, but 
I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.
In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.

Regards,
Kyrill

From: kurt greaves <k...@instaclustr.com<mailto:k...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill


From: Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is

Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
 
loss of QUORUM : can you afford eventual consistency ? Is it better to endure 
full downtime or just on a subset of your partitions ?
And can you design your cluster with racks/datacenters so that you can better 
predict how to run maintenance operations or if you may be losing QUORUM ?

The way Cassandra is designed also allows linear scalability, which 
master/slave based databases cannot handle (and master/slave architectures come 
with their set of challenges, especially during network partitions).

So, while the high availability isn't as transparent as one might think (and I 
understand why you may be disappointed), you have a lot of options on how to 
react to partial downtime, and that's something you must consider both when 
designing your cluster (how it is segmented, how operations are performed), and 
when designing your apps (how you will use the driver, how your apps will react 
to failure).

Cheers,


On Tue, Jan 16, 2018 at 11:03 AM Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.

More nodes in a cluster means higher probability of simultaneous node failures.

And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.


Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.

In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.


Regards,

Kyrill


From: kurt greaves <k...@instaclustr.com<mailto:k...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User

Subject: Re: vnodes: high availability
Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thanks Alexander!


I'm not a MS in math too) Unfortunately.


Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).


http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).


That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.


How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?


Regards,

Kyrill



From: Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability


I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which 
is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode 
distribution is purely random, and the replica for a vnode will be placed on 
the node that owns the next vnode in token order (yeah, that's not easy to 
formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 
(out of 49 remaining nodes, 2 replicas need to be p

Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.

More nodes in a cluster means higher probability of simultaneous node failures.

And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.


Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.

In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.


Regards,

Kyrill


From: kurt greaves <k...@instaclustr.com>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thanks Alexander!


I'm not a MS in math too) Unfortunately.


Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).


http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).


That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.


How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?


Regards,

Kyrill



From: Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability


I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which 
is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode 
distribution is purely random, and the replica for a vnode will be placed on 
the node that owns the next vnode in token order (yeah, that's not easy to 
formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 
(out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to 
exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to 
have a vnode replicated on node A than the opposite), that doubles the odds of 
2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more 
nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that 
area...)

How many queries that will affect is a different question as it depends on 
which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of 
racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a 
single rack, you will still have two other replicas available to satisfy quorum 
in the remaining ra

Re: vnodes: high availability

2018-01-15 Thread Kyrylo Lebediev
Thanks Alexander!


I'm not a MS in math too) Unfortunately.


Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).


http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).


That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.


How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?


Regards,

Kyrill



From: Alexander Dejanovski <a...@thelastpickle.com>
Sent: Monday, January 15, 2018 8:14:21 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability


I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which 
is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode 
distribution is purely random, and the replica for a vnode will be placed on 
the node that owns the next vnode in token order (yeah, that's not easy to 
formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 
(out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to 
exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to 
have a vnode replicated on node A than the opposite), that doubles the odds of 
2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more 
nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that 
area...)

How many queries that will affect is a different question as it depends on 
which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of 
racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a 
single rack, you will still have two other replicas available to satisfy quorum 
in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but 
the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as 
you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thanks, Rahul.

But in your example, at the same time loss of Node3 and Node6 leads to loss of 
ranges N, C, J at consistency level QUORUM.


As far as I understand in case vnodes > N_nodes_in_cluster and 
endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible 
for a range (in case of RF=3)

2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,


we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for 
this command:

nodetool ring | grep '^' | awk '{print $1}' | uniq | grep -B2 -A2 
'' | grep -v '' | grep -v '^--' | sort | uniq | wc 
-l

returned number which equals to Nnodes -1, what means that I can't switch off 2 
nodes at the same time w/o losing of some keyrange for CL=QUORUM.


Thanks,

Kyrill


From: Rahul Neelakantan <ra...@rahul.be<mailto:ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned 
to them. For example take a look at

Re: vnodes: high availability

2018-01-15 Thread Kyrylo Lebediev
Thanks, Rahul.

But in your example, at the same time loss of Node3 and Node6 leads to loss of 
ranges N, C, J at consistency level QUORUM.


As far as I understand in case vnodes > N_nodes_in_cluster and 
endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible 
for a range (in case of RF=3)

2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,


we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for 
this command:

nodetool ring | grep '^' | awk '{print $1}' | uniq | grep -B2 -A2 
'' | grep -v '' | grep -v '^--' | sort | uniq | wc 
-l

returned number which equals to Nnodes -1, what means that I can't switch off 2 
nodes at the same time w/o losing of some keyrange for CL=QUORUM.


Thanks,

Kyrill


From: Rahul Neelakantan <ra...@rahul.be>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned 
to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, 
will still not have any effect on Token Range A. But yes if you lose two nodes 
that both have Token Range A assigned to them (Say Node 1 and Node 2), you will 
have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having 
the client recognize a degraded cluster and operate temporarily in downgraded 
consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Hi,


Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch


Is it correct that 2 any nodes down will cause unavailability of a keyrange at 
CL=QUORUM?


Regards,

Kyrill



vnodes: high availability

2018-01-15 Thread Kyrylo Lebediev
Hi,


Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch


Is it correct that 2 any nodes down will cause unavailability of a keyrange at 
CL=QUORUM?


Regards,

Kyrill


RE: Reg :- Multiple Node Cluster set up in Virtual Box

2017-11-07 Thread Kyrylo Lebediev
Nandan,

There are several options available how this can be done.

For example, you may configure 2 network adapters per each VM:
1) NAT: in order the VM to have access to the Internet
2) Host-only Adapter - for internode communication setup (listen_address, 
rpc_address). Static IP configuration should be used for these interfaces.

Regards,
Kyrill

From: @Nandan@ [mailto:nandanpriyadarshi...@gmail.com]
Sent: Tuesday, November 7, 2017 6:49 AM
To: user 
Subject: Re: Reg :- Multiple Node Cluster set up in Virtual Box

Hi All,

Thanks for sharing all information.
I am starting to work on this.
Now Problem which I am getting right now is:-
1) How to select Network for Virtual Machine by which I can able to get 
different IP for different Virtual Box?
2) As I am using WIFI for HOST machine which is Windows 10, so is there any 
internal configuration required or I need to select specific Network Adapter 
into Virtual Boxs by which  I will get IP1,IP2,IP3 for node1,node2,node3 
respectively.

Please give me some ideas.
Thanks in advance,
Nandan Priyadarshi


On Tue, Nov 7, 2017 at 8:28 AM, James Briggs 
> wrote:
Nandan: The original Datastax training classes (when it was still called 
Riptano)
used 3 virtualbox Debian instances to setup a Cassandra cluster.

Thanks, James Briggs.
--
Cassandra/MySQL DBA. Available in San Jose area or remote.
cass_top: https://github.com/jamesbriggs/cassandra-top


From: kurt greaves >
To: User >
Sent: Monday, November 6, 2017 3:08 PM
Subject: Re: Reg :- Multiple Node Cluster set up in Virtual Box

Worth keeping in mind that in 3.6 onwards nodes will not start unless they can 
contact a seed. Not quite SPOF but still problematic. 
CASSANDRA-13851