from:"Jeff Jirsa"

Re: [EXTERNAL] Cassandra 3.11 - below normal disk read after restart

2024-09-06 Thread Jeff Jirsa

The unfortunate reality here is I don’t think anyone is going to be able to 
answer with the data provided.

Are the disk IOPS from cassandra reads? Or compaction? Or repair? Do they ramp 
with client reads (is that curve matching your customer traffic?)? 
Are they from client data reads or from internal reads (e.g. schema and auth 
from client reconnects)? 
Are they the first reads, or read repair? 

If this were my cluster, I’d be looking at the rest of the graphs to try to 
tell what “else” was happening beyond high read IOPS. If nothing stood out, I 
would have taken a stack trace to try to see what those nodes were doing at the 
time, vs what they’re doing “normally”. 


> On Sep 6, 2024, at 12:29 PM, Pradeep Badiger  wrote:
> 
> Thanks, Jeff. We use QUORUM consistency for reads and writes. Even we are 
> clueless as to why such an issue could occur. Do you think restarting again 
> and running the full repair on the node would help?
>  
> From: Jeff Jirsa mailto:jji...@gmail.com>>
> Sent: Friday, September 6, 2024 2:03 PM
> To: cassandra mailto:user@cassandra.apache.org>>
> Cc: Pradeep Badiger mailto:pradeepbadi...@fico.com>>
> Subject: [EXTERNAL] Re: Cassandra 3.11 - below normal disk read after restart
>  
> CAUTION: This email originated from outside the organization. Do not click 
> links or open attachments unless you recognize the sender and know the 
> content is safe.
> 
> If they went up by 1/7th, could potentially assume it was something related 
> to the snitch not choosing the restarted host. They went up by a lot (2-3x?). 
> What consistency level do you use for reads and writes, and do you have 
> graphs for local reads / hint delivery? (I’m GUESSING that you’re seeing 
> extra read repair or some other multiplier kick in, but it doesn’t make a lot 
> of sense to be honest). 
>  
>  
> 
> 
> On Sep 6, 2024, at 9:47 AM, Pradeep Badiger via user 
> mailto:user@cassandra.apache.org>> wrote:
>  
> Hi,
>  
> We are using Cassandra 3.11 with a cluster of 7 nodes and replication of 6 
> with most of the default configurations. During a recent maintenance window, 
> one of the nodes was restarted. The node came back up normal, with no errors 
> of any sort. But when the application started using the cluster, we found 
> below-normal disk io read rates on the node that was restarted, and other 
> nodes in the cluster reported above-normal disk io read rates. This 
> difference became significant causing alerts to get reported by the 
> monitoring system. As a measure to resolve the issue the application was 
> stopped and the entire cluster was restarted after which all 7 nodes reported 
> almost the same read rates.
>  
> 
> Figure 1 - After the node 53 was restarted.
> 
>  
> 
> Figure 2 - After the entire cluster restart.
> 
> The node in question was not down for a very long time. Is there any specific 
> reason the read rates would differ like this? Is there a way to resolve this 
> without restarting the entire cluster?
>  
> Thanks,
> Pradeep V.B.
> This email and any files transmitted with it are confidential, proprietary 
> and intended solely for the individual or entity to whom they are addressed. 
> If you have received this email in error please delete it immediately.
>  
> This email and any files transmitted with it are confidential, proprietary 
> and intended solely for the individual or entity to whom they are addressed. 
> If you have received this email in error please delete it immediately.

Re: Cassandra 3.11 - below normal disk read after restart

2024-09-06 Thread Jeff Jirsa

If they went up by 1/7th, could potentially assume it was something related to 
the snitch not choosing the restarted host. They went up by a lot (2-3x?). What 
consistency level do you use for reads and writes, and do you have graphs for 
local reads / hint delivery? (I’m GUESSING that you’re seeing extra read repair 
or some other multiplier kick in, but it doesn’t make a lot of sense to be 
honest). 



> On Sep 6, 2024, at 9:47 AM, Pradeep Badiger via user 
>  wrote:
> 
> Hi,
>  
> We are using Cassandra 3.11 with a cluster of 7 nodes and replication of 6 
> with most of the default configurations. During a recent maintenance window, 
> one of the nodes was restarted. The node came back up normal, with no errors 
> of any sort. But when the application started using the cluster, we found 
> below-normal disk io read rates on the node that was restarted, and other 
> nodes in the cluster reported above-normal disk io read rates. This 
> difference became significant causing alerts to get reported by the 
> monitoring system. As a measure to resolve the issue the application was 
> stopped and the entire cluster was restarted after which all 7 nodes reported 
> almost the same read rates.
>  
> 
> Figure 1 - After the node 53 was restarted.
> 
>  
> 
> Figure 2 - After the entire cluster restart.
> 
> The node in question was not down for a very long time. Is there any specific 
> reason the read rates would differ like this? Is there a way to resolve this 
> without restarting the entire cluster?
>  
> Thanks,
> Pradeep V.B.
> This email and any files transmitted with it are confidential, proprietary 
> and intended solely for the individual or entity to whom they are addressed. 
> If you have received this email in error please delete it immediately.

Re: null values injected while drop compact storage was executed

2024-05-07 Thread Jeff Jirsa

This sounds a lot like cassandra-13004 which was fixed, but broke data being 
read-repaired during an alter statement

I suspect it’s not actually that same bug, but may be close/related. 
Reproducing it reliably would be a huge help. 

- Jeff



> On May 7, 2024, at 1:50 AM, Matthias Pfau via user 
>  wrote:
> 
> Hi there,
> we just ran drop compact storage in order to prepare for the upgrade to 
> version 4.
> 
> We observed that column values have been written as null, if they where 
> inserted while the drop compact storage statement was running. This just 
> happened for the couple seconds the drop compact storage statement ran.
> 
> Did anyone else observe this? What are the proposed strategies to prevent 
> data loss.
> 
> Best,
> Matthias

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Jeff Jirsa

It seems like if folks really want the life of a connection to be finite 
(either client/server or server/server), adding in an option to quietly drain 
and recycle a connection on some period isn’t that difficult.

That type of requirement shows up in a number of environments, usually on 
interactive logins (cqlsh, login, walk away, the connection needs to become 
invalid in a short and finite period of time), but adding it to internode could 
also be done, and may help in some weird situations (if you changed certs 
because you believe a key/cert is compromised, having the connection remain 
active is decidedly inconvenient, so maybe it does make sense to add an 
expiration timer/condition on each connection).



> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
> 
> In addition to what Andy mentioned, I want to point out that for the vast 
> majority of use-cases, we would like to _avoid_ interruptions when a 
> certificate is updated so it is by design. If you're dealing with a situation 
> where you want to ensure that the connections are cycled, you can follow 
> Andy's advice. It will require automation outside the database that you might 
> already have. If there is demand, we can consider adding a feature to slowly 
> cycle the connections so the old SSL context is not used anymore.
> 
> One more thing you should bear in mind is that Cassandra will not load the 
> new SSL context if it cannot successfully initialize it. This is again by 
> design to prevent an outage when the updated truststore is corrupted or could 
> not be read in some way.
> 
> thanks,
> Dinesh
> 
> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  > wrote:
>> I should mention, when toggling disablebinary/enablebinary between
>> instances, you will probably want to give some time between doing this
>> so connections can reestablish, and you will want to verify that the
>> connections can actually reestablish.  You also need to be mindful of
>> this being disruptive to inflight queries (if your client is
>> configured for retries it will probably be fine).  Semantically to
>> your applications it should look a lot like a rolling cluster bounce.
>> 
>> Thanks,
>> Andy
>> 
>> On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
>> mailto:pabbireddyavin...@gmail.com>> wrote:
>> >
>> > Thanks Andy for your reply . We will test the scenario you mentioned.
>> >
>> > Regards
>> > Avinash
>> >
>> > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy > > > wrote:
>> >>
>> >> Hi Avinash,
>> >>
>> >> As far as I understand it, if the underlying keystore/trustore(s)
>> >> Cassandra is configured for is updated, this *will not* provoke
>> >> Cassandra to interrupt existing connections, it's just that the new
>> >> stores will be used for future TLS initialization.
>> >>
>> >> Via: 
>> >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>> >>
>> >> > When the files are updated, Cassandra will reload them and use them for 
>> >> > subsequent connections
>> >>
>> >> I suppose one could do a rolling disablebinary/enablebinary (if it's
>> >> only client connections) after you roll out a keystore/truststore
>> >> change as a way of enforcing the existing connections to reestablish.
>> >>
>> >> Thanks,
>> >> Andy
>> >>
>> >>
>> >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>> >> mailto:pabbireddyavin...@gmail.com>> wrote:
>> >> >
>> >> > Dear Community,
>> >> >
>> >> > I hope this email finds you well. I am currently testing SSL 
>> >> > certificate hot reloading on a Cassandra cluster running version 4.1 
>> >> > and encountered a situation that requires your expertise.
>> >> >
>> >> > Here's a summary of the process and issue:
>> >> >
>> >> > Reloading Process: We reloaded certificates signed by our in-house 
>> >> > certificate authority into the cluster, which was initially running 
>> >> > with self-signed certificates. The reload was done node by node.
>> >> >
>> >> > Truststore and Keystore: The truststore and keystore passwords are the 
>> >> > same across the cluster.
>> >> >
>> >> > Unexpected Behavior: Despite the different truststore configurations 
>> >> > for the self-signed and new CA certificates, we observed no breakdown 
>> >> > in server-to-server communication during the reload. We did not upload 
>> >> > the new CA cert into the old truststore.We anticipated interruptions 
>> >> > due to the differing truststore configurations but did not encounter 
>> >> > any.
>> >> >
>> >> > Post-Reload Changes: After reloading, we updated the cqlshrc file with 
>> >> > the new CA certificate and key to connect to cqlsh.
>> >> >
>> >> > server_encryption_options:
>> >> >
>> >> > internode_encryption: all
>> >> >
>> >> > keystore: ~/conf/server-keystore.jks
>> >> >
>> >> > keystore_password: 
>> >> >
>> >> > truststore: ~/conf/server-truststore.jks
>> >> >
>> >> > truststore_password: 
>> >> >
>> >

Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Jeff Jirsa

To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodetool decommission phase 
becomes almost a no-op. 

If you use the order below, the last nodes to decommission will cause those 
surviving machines to run out of space (assuming you have more than a few nodes 
to start)



> On Apr 8, 2024, at 6:58 AM, Jon Haddad  wrote:
> 
> You shouldn’t decom an entire DC before removing it from replication.
> 
> —
> 
> Jon Haddad
> Rustyrazorblade Consulting
> rustyrazorblade.com 
> 
> 
> On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user 
> mailto:user@cassandra.apache.org>> wrote:
>> Hello community,
>> 
>> In our deployments, we usually rebuild the Cassandra datacenters for 
>> maintenance or recovery operations.
>> 
>> The procedure used since the days of Cassandra 3.x was the one documented in 
>> datastax documentation. Decommissioning a datacenter | Apache Cassandra 3.x 
>> (datastax.com) 
>> 
>> After upgrading to Cassandra 4.1.4, we have realized that there are some 
>> stricter rules that do not allo to remove the replication when active 
>> Cassandra nodes still exist in a datacenter.
>> 
>> This check makes the above-mentioned procedure obsolete.
>> 
>> I am thinking to use the following as an alternative:
>> 
>> Make sure no clients are still writing to any nodes in the datacenter.
>> Run a full repair with nodetool repair.
>> Run nodetool decommission using the --force option on every node in the 
>> datacenter being removed.
>> Change all keyspaces so they no longer reference the datacenter being 
>> removed.
>>  
>> 
>> What is the procedure followed by other users? Do you see any risk following 
>> the proposed procedure?
>> 
>>  
>> 
>> BR
>> 
>> MK
>>

Re: Schema inconsistency in mixed-version cluster

2023-12-12 Thread Jeff Jirsa

A static collection is probably atypical, and again, would encourage you to
open a JIRA.

This seems like a case we should be able to find in a simulator.


On Tue, Dec 12, 2023 at 2:05 PM Sebastian Marsching 
wrote:

> I assume these are column names of a non-system table.
>
> This is correct. It is one of our application tables. The table has the
> following schema:
>
> CREATE TABLE pv_archive.channels (
> channel_name text,
> decimation_level int,
> bucket_start_time bigint,
> channel_data_id uuid static,
> control_system_type text static,
> server_id uuid static,
> decimation_levels set static,
> bucket_end_time bigint,
> PRIMARY KEY (channel_name, decimation_level, bucket_start_time)
> ) WITH CLUSTERING ORDER BY (decimation_level ASC, bucket_start_time ASC)
> AND additional_write_policy = '99p'
> AND bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND cdc = false
> AND comment = ''
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
> AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND memtable = 'default'
> AND crc_check_chance = 1.0
> AND default_time_to_live = 0
> AND extensions = {}
> AND gc_grace_seconds = 2592000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair = 'BLOCKING'
> AND speculative_retry = '99p';
>
> From the stack trace, this looks like an error from a node which was
> running 4.1.3, and this node was not the coordinator for this query.
>
> I did some research and found these bug reports which may be related:
>
>- CASSANDRA-15899
> Dropping a
>column can break queries until the schema is fully propagated
>- CASSANDRA-16735
> Adding columns
>via ALTER TABLE can generate corrupt sstables
>
> The solution for CASSANDRA-16735 was to revert CASSANDRA-15899, according
> to the comments in the ticket.
>
> This does look like CASSANDRA-15899 is back, but I can't see why it was
> only happening when the nodes were running mixed versions, and then stopped
> after all nodes were upgraded.
>
> I do not think that it is either of these bugs. These bugs occurred after
> altering a table, but I can say with certainty that this table has never
> been altered after it was created years ago.
>
> It must be a very strange bug where C* somehow gets confused about the
> schema for a table during an upgrade, even when the schema for this table
> did not change. I wonder whether it might have anything to do with the use
> of static columns…
>
> We have a second cluster that is using a setup that is pretty much
> identical and that we have not upgraded yet. We are now scheduling a bit of
> downtime for the upgrade there. As that cluster is rather small (only six
> nodes), upgrading the whole cluster should not take very long.
>
> It will be interesting to see, whether the problem will appear there to.
> If it doesn’t, this might have been some kind of freak accident that might
> not warrant further investigation. If it happens again, I might be able to
> collect more information.
>
>

Re: Schema inconsistency in mixed-version cluster

2023-12-12 Thread Jeff Jirsa

This deserves a JIRA



On Tue, Dec 12, 2023 at 8:30 AM Sebastian Marsching 
wrote:

> Hi,
>
> while upgrading our production cluster from C* 3.11.14 to 4.1.3, we
> experienced the issue that some SELECT queries failed due to supposedly no
> replica being available. The system logs on the C* nodes where full of
> messages like the following one:
>
> ERROR [ReadStage-1] 2023-12-11 13:53:57,278 JVMStabilityInspector.java:68
> - Exception in thread Thread[ReadStage-1,5,SharedPool]
> java.lang.IllegalStateException: [channel_data_id, control_system_type,
> server_id, decimation_levels] is not a subset of [channel_data_id]
> at
> org.apache.cassandra.db.Columns$Serializer.encodeBitmap(Columns.java:593)
> at
> org.apache.cassandra.db.Columns$Serializer.serializeSubset(Columns.java:523)
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serializeRowBody(UnfilteredSerializer.java:231)
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:205)
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137)
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:125)
> at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:140)
> at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:95)
> at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:80)
> at
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:308)
> at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
> at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:186)
> at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:182)
> at
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
> at
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337)
> at
> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63)
> at org.apache.cassandra.net
> .InboundSink.lambda$new$0(InboundSink.java:78)
> at org.apache.cassandra.net
> .InboundSink.accept(InboundSink.java:97)
> at org.apache.cassandra.net
> .InboundSink.accept(InboundSink.java:45)
> at org.apache.cassandra.net
> .InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
> at
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
> at
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> This problem only persisted while the cluster had a mix of 3.11.14 and
> 4.1.3 nodes. As soon as the last node was updated, the problem disappeared
> immediately, so I suspect that it was somehow caused by the unavoidable
> schema inconsistency during the upgrade.
>
> I just wanted to give everyone who hasn’t upgraded yet a heads up, so that
> they are aware that this problem might exist. Interestingly, it seems like
> not all queries involving the affected table were affected by this problem.
> As far as I am aware, no schema changes have ever been made to the affected
> table, so I am pretty certain that the schema inconsistencies were purely
> related to the upgrade process.
>
> We hadn’t noticed this problem when testing the upgrade on our test
> cluster because there we first did the upgrade and then ran the test
> workload. So, if you are worried you might be affected by this problem as
> well, you might want to run your workload on the test cluster while having
> mixed versions.
>
> I did not investigate the cause further because simply completing the
> upgrade process seemed like the quickest option to get the cluster fully
> operational again.
>
> Cheers,
> Sebastian
>
>

Re: Remove folders of deleted tables

2023-12-05 Thread Jeff Jirsa

The last time you mentioned this:

On Tue, Dec 5, 2023 at 11:57 AM Sébastien Rebecchi 
wrote:

> Hi Bowen,
>
> Thanks for your answer.
>
> I was thinking of extreme use cases, but as far as I am concerned I can
> deal with creation and deletion of 2 tables every 6 hours for a keyspace.
> So it lets around 8 folders of deleted tables per day - sometimes more
> cause I can see sometimes 2 folders created for a same table name, with 2
> different ids, caused by temporary schema disagreements I guess.
>

I told you it's much worse than you're assuming it is:
https://lists.apache.org/thread/fzkn3vqjyfjslcv97wcycb6w0wn5ltk2

Here's a more detailed explanation:
https://www.mail-archive.com/user@cassandra.apache.org/msg62206.html

(This is fixed and strictly safe in the version of cassandra with
transactional cluster metadata, which just got merged to trunk in the past
month, so "will be safe soon").

Re: Migrating to incremental repair in C* 4.x

2023-11-27 Thread Jeff Jirsa

I don’t work for datastax, thats not my blog, and I’m on a phone and 
potentially missing nuance, but I’d never try to convert a cluster to IR by 
disabling auto compaction . It sounds very much out of date or its optimized 
for fixing one node in a cluster somehow. It didn’t make sense in the 4.0 era. 

Instead I’d leave compaction running and slowly run incremental repair across 
parts of the token range, slowing down as pending compactions increase

I’d choose token ranges such that you’d repair 5-10% of the data on each node 
at a time



> On Nov 23, 2023, at 11:31 PM, Sebastian Marsching  
> wrote:
> 
> Hi,
> 
> we are currently in the process of migrating from C* 3.11 to C* 4.1 and we 
> want to start using incremental repairs after the upgrade has been completed. 
> It seems like all the really bad bugs that made using incremental repairs 
> dangerous in C* 3.x have been fixed in 4.x, and for our specific workload, 
> incremental repairs should offer a significant performance improvement.
> 
> Therefore, I am currently devising a plan how we could migrate to using 
> incremental repairs. I am aware of the guide from DataStax 
> (https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesMigration.html),
>  but this guide is quite old and was written with C* 3.0 in mind, so I am not 
> sure whether this still fully applies to C* 4.x.
> 
> In addition to that, I am not sure whether this approach fits our workload. 
> In particular, I am wary about disabling autocompaction for an extended 
> period of time (if you are interested in the reasons why, they are at the end 
> of this e-mail).
> 
> Therefore, I am wondering whether a slighly different process might work 
> better for us:
> 
> 1. Run a full repair (we periodically run those anyway).
> 2. Mark all SSTables as repaired, even though they will include data that has 
> not been repaired yet because it was added while the repair process was 
> running.
> 3. Run another full repair.
> 4. Start using incremental repairs (and the occasional full repair in order 
> to handle bit rot etc.).
> 
> If I understood the interactions between full repairs and incremental repairs 
> correctly, step 3 should repair potential inconsistencies in the SSTables 
> that were marked as repaired in step 2 while avoiding the problem of 
> overstreaming that would happen when only marking those SSTables as repaired 
> that already existed before step 1.
> 
> Does anyone see a flaw in this concept or has experience with a similar 
> scenario (migrating to incremental repairs in an environment with 
> high-density nodes, where a single table contains most of the data)?
> 
> I am also interested in hearing about potential problems other C* users 
> experienced when migrating to incremental repairs, so that we get a better 
> idea what to expect.
> 
> Thanks,
> Sebastian
> 
> 
> Here is the explanation why I am being cautious:
> 
> More than 95 percent of our data is stored in a single table, and we use high 
> density nodes (storing about 3 TB of data per node). This means that a full 
> repair for the whole cluster takes about a week.
> 
> The reason for this layout is that most of our data is “cold”, meaning that 
> it is written once, never updated, and rarely deleted or read. However, new 
> data is added continuously, so disabling autocompaction for the duration of a 
> full repair would lead to a high number of small SSTables accumulating over 
> the course of the week, and I am not sure how well the cluster would handle 
> such a situation (and the increased load when autocompaction is enabled 
> again).

Re: Upgrade from C* 3 to C* 4 per datacenter

2023-10-26 Thread Jeff Jirsa

> On Oct 26, 2023, at 12:32 AM, Michalis Kotsiouros (EXT) via user 
>  wrote:
> 
> 
> Hello Cassandra community,
> We are trying to upgrade our systems from Cassandra 3 to Cassandra 4. We plan 
> to do this per data center.
> During the upgrade, a cluster with mixed SW levels is expected. At this point 
> is it possible to perform topology changes?

You may find that this works at various stages when an entire dc is upgraded or 
not upgraded if you don’t do any cross-DC streaming (eg no SimpleStrategy 
keyspaces at all).  It’s not guaranteed to work, we don’t test it, but I expect 
that it probably will.

Schema changes will not. 

> In case of an upgrade failure, would it be possible to remove the data center 
> from the cluster, restore the datacenter to C*3 SW and add it back to cluster 
> which will contain datacenters in both C* 3 and C*4?

Definitely possible for the first DC you upgrade
Untested for the second-last 

> Alternatively, could we remove the datacenter, perform the SW upgrade to C*4 
> and then add it back to the cluster?

Not really. Probably technically possible but doesn’t make a lot of practical 
sense

> Are there any suggestions or experiences regarding this fallback scenario?

Doing one host, then one replica of each replica set (1/3rd of hosts / 1 AZ), 
then one DC, then repeat for all DCs. The point of no return, so to speak, is 
when you get into the second AZ of the second DC. Until that point you can just 
act as if the upgraded hosts failed all at once and re-stream that data via 
bootstrap 

Not clear what exactly worries you in the upgrade, but restore a backup to a 
lab and run the upgrade once or twice offline. Doesn’t have to be a full size 
cluster, just a few hosts in a few fake DCs. The 3-4 upgrade was pretty 
uneventful compared to past upgrades, especially if you use the later releases. 
Good to be cautious, though.

Re: Resources to understand rebalancing

2023-10-25 Thread Jeff Jirsa

Data ownership is defined by the token ring concept.

Hosts in the cluster may have tokens - let's oversimplify to 5 hosts, each
with 1 token A=0, B=1000, C=2000, D=3000, E=4000

The partition key is hashed to calculate the token, and the next 3 hosts in
the ring are the "owners" of that data - a key that hashes to 1234 would be
found on hosts C, D, E

Anytime hosts move tokens (joining/expansion, leaving/shrink,
re-arranging/moves), the tokens go into a pending state.

So if you were to add a 6th host here, let's say F=2500, when it first
gossips, it'll have 2500 in a different JOINING (pending) state. In that
state, it won't get any read traffic, but the quorum calculations will be
augmented to send extra writes - instead of needing 2/3 nodes to ack any
write, it'll require a total of 3 acks (of the 4 possible replicas, the 3
natural replicas and the 1 pending replica).

When the node finishes joining, and it gossips its state=NORMAL, it'll be
removed from pending, and the reads will move to it instead.

The gossip state transition from pending to normal isn't exact, it's
propagated via gossip (so it's seconds of change where reads/writes can hit
either replica), but the increase in writes (writing to both destinations)
should make it safe in that transition. It's being rewritten to be
transactional in an upcoming version of cassandra.

On Tue, Oct 24, 2023 at 11:39 PM Vikas Kumar  wrote:

> Hi folks,
>
> I am looking for some resources to understand the internals of rebalancing
> in Cassandra. Specifically:
>
>  - How are read and write queries served during data migration?
>  - How is the cutover from the current node to the new node performed?
>
> Any help is greatly appreciated.
>
> Thanks,
> Vikas
>

Re: Cassandra 4.0.6 token mismatch issue in production environment

2023-10-23 Thread Jeff Jirsa

Not aware of any that survive node restart, though in the past, there were
races around starting an expansion while one node was partitioned/down (and
missing the initial gossip / UP). A heap dump could have told us a bit more
conclusively, but it's hard to guess for now.



On Mon, Oct 23, 2023 at 3:22 PM Jaydeep Chovatia 
wrote:

> The issue was persisting on a few nodes despite no changes to the
> topology. Even node restarting did not help. Only after we evacuated those
> nodes, the issue got resolved.
>
> Do you think of a possible situation under which this could happen?
>
> Jaydeep
>
> On Sat, Oct 21, 2023 at 10:25 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Thanks, Jeff!
>> I will keep this thread updated on our findings.
>>
>> Jaydeep
>>
>> On Sat, Oct 21, 2023 at 9:37 AM Jeff Jirsa  wrote:
>>
>>> That code path was added to protect against invalid gossip states
>>>
>>> For this logger to be issued, the coordinator receiving the query must
>>> identify a set of replicas holding the data to serve the read, and one of
>>> the selected replicas must disagree that it’s a replica based on its view
>>> of the token ring
>>>
>>> This probably means that at least one node in your cluster has an
>>> invalid view of the ring - if you issue a “nodetool ring” from every host
>>> and compare them, you’ll probably notice one or more is wrong
>>>
>>> It’s also possible this happens for a few seconds during adding / moving
>>> / removing hosts
>>>
>>> If you weren’t changing the topology of the cluster, it’s  likely the
>>> case that bouncing the cluster fixes it
>>>
>>> (Im unsure of the defaults and not able to look it up, but cassandra can
>>> log or log and drop the read - you probably want to drop the read log,
>>> which is the right solution so it doesn’t accidentally return a missing /
>>> empty result set as a valid query result, instead it’ll force it to read
>>> from other replicas or time out)
>>>
>>>
>>>
>>>
>>>
>>> On Oct 20, 2023, at 10:57 PM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>> 
>>>
>>> Hi,
>>>
>>> I am using Cassandra 4.0.6 in production, and receiving the following 
>>> error. This indicates that Cassandra nodes have mismatch in token-owership.
>>>
>>> Has anyone seen this issue before?
>>>
>>> Received a read request from /XX.XX.XXX.XXX:Y for a range that is not 
>>> owned by the current replica Read(keyspace.table columns=*/[c1] rowFilter= 
>>> limits=LIMIT 100 key=7BE78B90-AD66-406B-AA05-6A062F72F542:0 
>>> filter=slice(slices=ALL, reversed=false), nowInSec=1697751757).
>>>
>>> Jaydeep
>>>
>>>

Re: Cassandra 4.0.6 token mismatch issue in production environment

2023-10-21 Thread Jeff Jirsa

That code path was added to protect against invalid gossip states

For this logger to be issued, the coordinator receiving the query must identify 
a set of replicas holding the data to serve the read, and one of the selected 
replicas must disagree that it’s a replica based on its view of the token ring

This probably means that at least one node in your cluster has an invalid view 
of the ring - if you issue a “nodetool ring” from every host and compare them, 
you’ll probably notice one or more is wrong 

It’s also possible this happens for a few seconds during adding / moving / 
removing hosts 

If you weren’t changing the topology of the cluster, it’s  likely the case that 
bouncing the cluster fixes it 

(Im unsure of the defaults and not able to look it up, but cassandra can log or 
log and drop the read - you probably want to drop the read log, which is the 
right solution so it doesn’t accidentally return a missing / empty result set 
as a valid query result, instead it’ll force it to read from other replicas or 
time out)





> On Oct 20, 2023, at 10:57 PM, Jaydeep Chovatia  
> wrote:
> 
> 
> Hi,
> I am using Cassandra 4.0.6 in production, and receiving the following error. 
> This indicates that Cassandra nodes have mismatch in token-owership.
> Has anyone seen this issue before?
> Received a read request from /XX.XX.XXX.XXX:Y for a range that is not 
> owned by the current replica Read(keyspace.table columns=*/[c1] rowFilter= 
> limits=LIMIT 100 key=7BE78B90-AD66-406B-AA05-6A062F72F542:0 
> filter=slice(slices=ALL, reversed=false), nowInSec=1697751757).
> Jaydeep

Re: java driver with cassandra proxies (option: -Dcassandra.join_ring=false)

2023-10-12 Thread Jeff Jirsa

Just to be clear:

- How many of the proxy nodes are you providing as contact points? One of
them or all of them?

It sounds like you're saying you're passing all of them, and only one is
connecting, and the driver is declining to connect to the rest because
they're not in system.peers. I'm not surprised that the proxies aren't in
system.peers, but I'd have also expected that if you pass all proxies in
contact points, it'd connect to all of them, so I think you're
appropriately surprised here.



On Thu, Oct 12, 2023 at 5:09 AM Regis Le Bretonnic <
r.lebreton...@meetic-corp.com> wrote:

> We have tested Stargate and were very disappointed...
>
> Originally our architecture was PHP microservices (with FPM) + cassandra
> proxies.
> But we were blocked because PHP driver is no more supported.
>
> We made tests to keep PHP + stargate but there were many issues, the main
> one (but not the only one) being stargate does not support "ALLOW
> FILTERING" clause. I don't want to re-open this debate I already had with
> Stargate maintainers...
>
> We finally decided to move from PHP to java but we'd like to keep
> cassandra proxies that are very usefull.
>
> Regards
>
> Le jeu. 12 oct. 2023 à 12:05, Erick Ramirez  a
> écrit :
>
>> Those nodes are not in the peers table(s) because you told them NOT to
>> join the ring with `join_ring=false` so it is working by design.
>>
>> I'm not really sure what you're trying to achieve but if you want to
>> separate the coordinator functions from the storage then what you probably
>> want is to deploy Stargate nodes . Stargate is a
>> data API gateway that sits between the app instances and the Cassandra
>> database. It decouples client request coordination from the storage aspects
>> of C*. It also allows you to perform CRUD operations against C* using APIs
>> -- REST, JSON, gRPC, GraphQL.
>>
>> See the docs on Using the Stargate CQL API
>>  for details
>> on how to set up Stargate nodes as coordinators for your C* database.
>>
>> If you want to see it in action, you can try it free on Astra DB
>>  (Cassandra-as-a-service). Cheers!
>>
>>>

Re: Startup errors - 4.1.3

2023-08-30 Thread Jeff Jirsa

There are at least two bugs in the compaction lifecycle transaction log -
one that can drop an ABORT / ADD in the wrong order (and prevent startup),
and one that allows for invalid timestamps in the log file (and again,
prevent startups).

 I believe it's safe to work around the former by removing the .log file,
and you can work around the latter by using `touch` to update the
timestamps of the data file that mismatches, but I can't find the relevant
JIRAs to be 100% sure.

(Also, it may be a good trigger to cut a new release, because things that
block startup are obviously quite serious).




On Wed, Aug 30, 2023 at 6:59 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi all - I replaced a node in a 14 node cluster, and it rebuilt OK.  I
> started to see a lot of timeout errors, and discovered one of the nodes
> had this message constantly repeated:
> "waiting to acquire a permit to begin streaming" - so perhaps I hit this
> bug:
> https://www.mail-archive.com/commits@cassandra.apache.org/msg284709.html
>
> I then restarted that node, but it gave a bunch of errors about
> "unexpected disk state: failed to read translation log"
> I deleted the corresponding files and got that node to come up, but now
> when I restart any of the other nodes in the cluster, they too do not
> start back up:
>
> Example:
>
> INFO  [main] 2023-08-30 09:50:46,130 LogTransaction.java:544 - Verifying
> logfile transaction
> [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
>
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
> ERROR [main] 2023-08-30 09:50:46,154 LogReplicaSet.java:145 - Mismatched
> line in file nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log: got
> 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]'
>
> expected
> 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]',
>
> giving up
> ERROR [main] 2023-08-30 09:50:46,155 LogFile.java:164 - Failed to read
> records for transaction log
> [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
>
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
> ERROR [main] 2023-08-30 09:50:46,156 LogTransaction.java:559 -
> Unexpected disk state: failed to read transaction log
> [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
>
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
> Files and contents follow:
>
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]
>  ABORT:[,0,0][737437348]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
>  ***Does not match
> 
>
> in first replica file
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]
>
> ERROR [main] 2023-08-30 09:50:46,156 CassandraDaemon.java:897 - Cannot
> remove temporary or obsoleted files for doc.extractedmetadata due to a
> problem with transaction log files. Please check records with problems
> in the log messages above and fix them. Refer to the 3.0 upgrading
> instructions in NEWS.txt for a description of transaction log files.
>
> I then delete the files and eventually after many iterations, the node
> comes back up.
> The table 'extractedmetadata' has 29 billion records.  Just a data point
> here - I think the

Re: Big Data Question

2023-08-21 Thread Jeff Jirsa

(Yes, just somewhat less likely to be the same order of speed-up in STCS
where sstables are more likely to cross token boundaries, modulo some stuff
around sstable splitting at token ranges a la 6696)

On Mon, Aug 21, 2023 at 11:35 AM Dinesh Joshi  wrote:

> Minor correction, zero copy streaming aka faster streaming also works for
> STCS.
>
> Dinesh
>
> On Aug 21, 2023, at 8:01 AM, Jeff Jirsa  wrote:
>
> 
> There's a lot of questionable advice scattered in this thread. Set aside
> most of the guidance like 2TB/node, it's old and super nuanced.
>
> If you're bare metal, do what your organization is good at. If you have
> millions of dollars in SAN equipment and you know how SANs work and fail
> and get backed up, run on a SAN if your organization knows how to properly
> operate a SAN. Just make sure you understand it's a single point of failure.
>
> If you're in the cloud, EBS is basically the same concept. You can lose
> EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that.
> Have backups. Know how to restore them.
>
> The reason the "2TB/node" limit was a thing was around time to recover
> from failure more than anything else. I described this in detail here, in
> 2015, before faster-streaming in 4.0 was a thing :
> https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
> . With faster streaming, IF you use LCS (so faster streaming works), you
> can probably go at least 4-5x more dense than before, if you understand how
> likely your disks are to fail and you can ensure you dont have correlated
> failures when they age out (that means if you're on bare metal, measuring
> flash life, and ideally mixing vendors to avoid firmware bugs).
>
> You'll still see risks of huge clusters, largely in gossip and schema
> propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
> especially) than 3.0 was, but for "max nodes in a cluster", what you're
> really comparing is "how many gossip speakers and tokens are in the
> cluster" (which means your vnode settings matter, for things like pending
> range calculators).
>
> Looking at the roadmap, your real question comes down to :
> - If you expect to use the transactional features in Accord/5.0 to
> transact across rows/keys, you probably want to keep one cluster
> - If you dont ever expect to use multi-key transactions, just de-risk by
> sharding your cluster into many smaller clusters now, with consistent
> hashing to map keys to clusters, and have 4 clusters of the same smaller
> size, with whatever node density you think you can do based on your
> compaction strategy and streaming rate (and disk type).
>
> If you have time and budget, create a 3 node cluster with whatever disks
> you have, fill them, start working on them - expand to 4, treat one as
> failed and replace it - simulate the operations you'll do at that size.
> It's expensive to mimic a 500 host cluster, but if you've got budget, try
> it in AWS and see what happens when you apply your real schema, and then do
> a schema change.
>
>
>
>
>
> On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> For our scenario, the goal is to minimize down-time for a single (at
>> least initially) data center system.  Data-loss is basically unacceptable.
>> I wouldn't say we have a "rusty slow data center" - we can certainly use
>> SSDs and have servers connected via 10G copper to a fast back-plane.  For
>> our specific use case with Cassandra (lots of writes, small number of
>> reads), the network load is usually pretty low.  I suspect that would
>> change if we used Kubernetes + central persistent storage.
>> Good discussion.
>>
>> -Joe
>> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>>
>> I started to respond, then realized I and the other OP posters are not
>> thinking the same: What is the business case for availability, data
>> los/reload/recoverability? You all argue for higher availability and damn
>> the cost. But noone asked "can you lose access, for 20 minutes, to a
>> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
>> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
>> node cluster holding the same data?
>>
>> Then we can discuss 32/64g JVM and SSD's.
>> *.*
>> *Arthur C. Clarke famously said that "technology sufficiently advanced is
>> indistinguishable from magic." Magic is coming, and it's coming for all of
>> us*
>>
>> *Daemeon R

Re: Big Data Question

2023-08-21 Thread Jeff Jirsa

There's a lot of questionable advice scattered in this thread. Set aside
most of the guidance like 2TB/node, it's old and super nuanced.

If you're bare metal, do what your organization is good at. If you have
millions of dollars in SAN equipment and you know how SANs work and fail
and get backed up, run on a SAN if your organization knows how to properly
operate a SAN. Just make sure you understand it's a single point of failure.

If you're in the cloud, EBS is basically the same concept. You can lose EBS
in an AZ, just like you can lose SAN in a DC. Persist outside of that. Have
backups. Know how to restore them.

The reason the "2TB/node" limit was a thing was around time to recover from
failure more than anything else. I described this in detail here, in 2015,
before faster-streaming in 4.0 was a thing :
https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
. With faster streaming, IF you use LCS (so faster streaming works), you
can probably go at least 4-5x more dense than before, if you understand how
likely your disks are to fail and you can ensure you dont have correlated
failures when they age out (that means if you're on bare metal, measuring
flash life, and ideally mixing vendors to avoid firmware bugs).

You'll still see risks of huge clusters, largely in gossip and schema
propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
especially) than 3.0 was, but for "max nodes in a cluster", what you're
really comparing is "how many gossip speakers and tokens are in the
cluster" (which means your vnode settings matter, for things like pending
range calculators).

Looking at the roadmap, your real question comes down to :
- If you expect to use the transactional features in Accord/5.0 to transact
across rows/keys, you probably want to keep one cluster
- If you dont ever expect to use multi-key transactions, just de-risk by
sharding your cluster into many smaller clusters now, with consistent
hashing to map keys to clusters, and have 4 clusters of the same smaller
size, with whatever node density you think you can do based on your
compaction strategy and streaming rate (and disk type).

If you have time and budget, create a 3 node cluster with whatever disks
you have, fill them, start working on them - expand to 4, treat one as
failed and replace it - simulate the operations you'll do at that size.
It's expensive to mimic a 500 host cluster, but if you've got budget, try
it in AWS and see what happens when you apply your real schema, and then do
a schema change.

On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> For our scenario, the goal is to minimize down-time for a single (at least
> initially) data center system.  Data-loss is basically unacceptable.  I
> wouldn't say we have a "rusty slow data center" - we can certainly use SSDs
> and have servers connected via 10G copper to a fast back-plane.  For our
> specific use case with Cassandra (lots of writes, small number of reads),
> the network load is usually pretty low.  I suspect that would change if we
> used Kubernetes + central persistent storage.
> Good discussion.
>
> -Joe
> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>
> I started to respond, then realized I and the other OP posters are not
> thinking the same: What is the business case for availability, data
> los/reload/recoverability? You all argue for higher availability and damn
> the cost. But noone asked "can you lose access, for 20 minutes, to a
> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
> node cluster holding the same data?
>
> Then we can discuss 32/64g JVM and SSD's.
> *.*
> *Arthur C. Clarke famously said that "technology sufficiently advanced is
> indistinguishable from magic." Magic is coming, and it's coming for all of
> us*
>
> *Daemeon Reiydelle*
> *email: daeme...@gmail.com *
> *LI: https://www.linkedin.com/in/daemeonreiydelle/
> *
> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>
>
> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> Was assuming reaper did incremental?  That was probably a bad assumption.
>>
>> nodetool repair -pr
>> I know it well now!
>>
>> :)
>>
>> -Joe
>>
>> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
>> > I don't have experience with Cassandra on Kubernetes, so I can't
>> > comment on that.
>> >
>> > For repairs, may I interest you with incremental repairs? It will make
>> > repairs hell of a lot faster. Of course, occasional full repair is
>> > still needed, but that's another story.
>> >
>> >
>> > On 17/08/2023 21:36, Joe Obernberger wrote:
>> >> Thank you.  Enjoying this conversation.
>> >> Agree on blade servers, where each blade has a small number of SSDs.
>> >> Yeh/Nah to a kubernetes app

Re: Big Data Question

2023-08-16 Thread Jeff Jirsa

A lot of things depend on actual cluster config - compaction settings (LCS
vs STCS vs TWCS) and token allocation (single token, vnodes, etc) matter a
ton.

With 4.0 and LCS, streaming for replacement is MUCH faster, so much so that
most people should be fine with 4-8TB/node, because the rebuild time is
decreased by an order of magnitude.

If you happen to have large physical machines, running multiple instances
on a machine (each with a single token, and making sure you match rack
awareness) sorta approximates vnodes without some of the unpleasant side
effects.

If you happen to run on more-reliable-storage (like EBS, or a SAN, and you
understand what that means from a business continuity perspective), then
you can assume that your rebuild frequency is probably an order of
magnitude less often, so you can adjust your risk calculation based on
measured reliability there (again, EBS and other disaggregated disks still
fail, just less often than single physical flash devices).

Seed nodes never really need to change significantly. You should be fine
with 2-3 per DC no matter the instance count.

On Wed, Aug 16, 2023 at 8:34 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> General question on how to configure Cassandra.  Say I have 1PByte of
> data to store.  The general rule of thumb is that each node (or at least
> instance of Cassandra) shouldn't handle more than 2TBytes of disk.  That
> means 500 instances of Cassandra.
>
> Assuming you have very fast persistent storage (such as a NetApp,
> PorterWorx etc.), would using Kubernetes or some orchestration layer to
> handle those nodes be a viable approach?  Perhaps the worker nodes would
> have enough RAM to run 4 instances (pods) of Cassandra, you would need
> 125 servers.
> Another approach is to build your servers with 5 (or more) SSD devices -
> one for OS, four for each instance of Cassandra running on that server.
> Then build some scripts/ansible/puppet that would manage Cassandra
> start/stops, and other maintenance items.
>
> Where I think this runs into problems is with repairs, or sstablescrubs
> that can take days to run on a single instance.  How is that handled 'in
> the real world'?  With seed nodes, how many would you have in such a
> configuration?
> Thanks for any thoughts!
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com
>

Re: Cassandra p95 latencies

2023-08-11 Thread Jeff Jirsa

You’re going to have to help us help you 4.0 is pretty widely deployed. I’m not aware of a perf regression Can you give us a schema (anonymized) and queries and show us a trace ? On Aug 10, 2023, at 10:18 PM, Shaurya Gupta  wrote:The queries are rightly designed as I already explained. 40 ms is way too high as compared to what I seen with other DBs and many a times with Cassandra 3.x versions.CPU consumed as I mentioned is not high, it is around 20%.On Thu, Aug 10, 2023 at 5:14 PM MyWorld  wrote:Hi,P95 should not be a problem if rightly designed. Levelled compaction strategy further reduces this, however it consume some resources. For read, caching is also helpful. Can you check your cpu iowait as it could be the reason for delay Regards,AshishOn Fri, 11 Aug, 2023, 04:58 Shaurya Gupta,  wrote:Hi communityWhat is the expected P95 latency for Cassandra Read and Write queries executed with Local_Quorum over a table with 3 replicas ? The queries are done using the partition + clustering key and row size in bytes is not too much, maybe 1-2 KB maximum.Assuming CPU is not a crunch ?We observe those to be 40 ms P95 Reads and same for Writes. This looks very high as compared to what we expected. We are using Cassandra 4.0.Any documentation / numbers will be helpful.Thanks-- Shaurya Gupta

-- Shaurya Gupta

Re: write on ONE node vs replication factor

2023-07-15 Thread Jeff Jirsa

Consistency level controls when queries acknowledge/succeed

Replication factor is where data lives / how many copies 

If you write at consistency ONE and replication factor 3, the query finishes 
successfully when the write is durable on one of the 3 copies.

It will get sent to all 3, but it’ll return when it’s durable on one.

If you write at ONE and it goes to the first replica, and you read at ONE and 
it reads from the last replica, it may return without the data:  you may not 
see a given write right away. 

> On Jul 15, 2023, at 7:05 PM, Anurag Bisht  wrote:
> 
> 
> Hello Users,
> 
> I am new to Cassandra and trying to understand the architecture of it. If I 
> write to ONE node for a particular key and have a replication factor of 3, 
> would the written key will get replicated to the other two nodes ? Let  me 
> know if I am thinking incorrectly.
> 
> Thanks,
> Anurag

Re: Replacing node without shutting down the old node

2023-05-16 Thread Jeff Jirsa

In-line On May 15, 2023, at 5:26 PM, Runtian Liu wrote:Hi Jeff,I tried the setup with vnode 16 and NetworkTopologyStrategy replication strategy with replication factor 3 with 3 racks in one cluster. When using the new node token as the old node token - 1I had said +1 but you’re right that it’s actually -1 , sorry about that. You want the new node to be lower than the existing host. The lower token will take most of the data. I see the new node is streaming from the old node only. And the decom phase of the old node is extremely fast. Does this mean the new node will only take data ownership from the old node?With exactly three racks, yes. With more racks or fewer racks, no. I also did some cleanups after replacing node with old token - 1 and the cleanup sstable count was not increasing. Looks like adding a node with old_token - 1 and decom the old node will not generate stale data on the rest of the cluster. Do you know if there are any edge cases that in this replacement process can generate any stale data on other nodes of the cluster with the setup I mentioned?Should do exactly what you want. I’d still run cleanup but it should be a no-op. Thanks,RuntianOn Mon, May 8, 2023 at 9:59 PM Runtian Liu <curly...@gmail.com> wrote:I thought the joining node would not participate in quorum? How are we counting things like how many replicas ACK a write when we are adding a new node for expansion? The token ownership won't change until the new node is fully joined right? On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa <jji...@gmail.com> wrote:You can't have two nodes with the same token (in the current metadata implementation) - it causes problems counting things like how many replicas ACK a write, and what happens if the one you're replacing ACKs a write but the joining host doesn't? It's harder than it seems to maintain consistency guarantees in that model, because you have 2 nodes where either may end up becoming the sole true owner of the token, and you have to handle both cases where one of them fails. An easier option is to add it with new token set to old token +1 (as an expansion), then decom the leaving node (shrink). That'll minimize streaming when you decommission that node. On Mon, May 8, 2023 at 7:19 PM Runtian Liu <curly...@gmail.com> wrote:Hi all,Sometimes we want to replace a node for various reasons, we can replace a node by shutting down the old node and letting the new node stream data from other replicas, but this approach may have availability issues or data consistency issues if one more node in the same cluster went down. Why Cassandra doesn't support replacing a node without shutting down the old one? Can we treat the new node as normal node addition while having exactly the same token ranges as the node to be replaced. After the new node's joining process is complete, we just need to cut off the old node. With this, we don't lose any availability and the token range is not moved so no clean up is needed. Is there any downside of doing this?Thanks,Runtian

Re: Replacing node without shutting down the old node

2023-05-08 Thread Jeff Jirsa

You can't have two nodes with the same token (in the current metadata
implementation) - it causes problems counting things like how many replicas
ACK a write, and what happens if the one you're replacing ACKs a write but
the joining host doesn't? It's harder than it seems to maintain consistency
guarantees in that model, because you have 2 nodes where either may end up
becoming the sole true owner of the token, and you have to handle both
cases where one of them fails.

An easier option is to add it with new token set to old token +1 (as an
expansion), then decom the leaving node (shrink). That'll minimize
streaming when you decommission that node.

On Mon, May 8, 2023 at 7:19 PM Runtian Liu  wrote:

> Hi all,
>
> Sometimes we want to replace a node for various reasons, we can replace a
> node by shutting down the old node and letting the new node stream data
> from other replicas, but this approach may have availability issues or data
> consistency issues if one more node in the same cluster went down. Why
> Cassandra doesn't support replacing a node without shutting down the old
> one? Can we treat the new node as normal node addition while having exactly
> the same token ranges as the node to be replaced. After the new node's
> joining process is complete, we just need to cut off the old node. With
> this, we don't lose any availability and the token range is not moved so no
> clean up is needed. Is there any downside of doing this?
>
> Thanks,
> Runtian
>

Re: Is cleanup is required if cluster topology changes

2023-05-05 Thread Jeff Jirsa

Lots of caveats on these suggestions, let me try to hit most of them.

Cleanup in parallel is good and fine and common. Limit number of threads in
cleanup if you're using lots of vnodes, so each node runs one at a time and
not all nodes use all your cores at the same time.
If a host is fully offline, you can ALSO use replace address first boot.
It'll stream data right to that host with the same token assignments you
had before, and no cleanup is needed then. Strictly speaking, to avoid
resurrection here, you'd want to run repair on the replicas of the down
host (for vnodes, probably the whole cluster), but your current process
doesnt guarantee that either (decom + bootstrap may resurrect, strictly
speaking).
Dropping vnodes will reduce the replicas that have to be cleaned up, but
also potentially increase your imbalance on each replacement.

Cassandra should still do this on its own, and I think once CEP-21 is
committed, this should be one of the first enhancement tickets.

Until then, LeveledCompactionStrategy really does make cleanup fast and
cheap, at the cost of higher IO the rest of the time. If you can tolerate
that higher IO, you'll probably appreciate LCS anyway (faster reads, faster
data deletion than STCS). It's a lot of IO compared to STCS though.

On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia 
wrote:

> Thanks all for your valuable inputs. We will try some of the suggested
> methods in this thread, and see how it goes. We will keep you updated on
> our progress.
> Thanks a lot once again!
>
> Jaydeep
>
> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Depending on the number of vnodes per server, the probability and
>> severity (i.e. the size of the affected token ranges) of an availability
>> degradation due to a server failure during node replacement may be small.
>> You also have the choice of increasing the RF if that's still not
>> acceptable.
>>
>> Also, reducing number of vnodes per server can limit the number of
>> servers affected by replacing a single server, therefore reducing the
>> amount of time required to run "nodetool cleanup" if it is run sequentially.
>>
>> Finally, you may choose to run "nodetool cleanup" concurrently on
>> multiple nodes to reduce the amount of time required to complete it.
>>
>>
>> On 05/05/2023 16:26, Runtian Liu wrote:
>>
>> We are doing the "adding a node then decommissioning a node" to
>> achieve better availability. Replacing a node need to shut down one node
>> first, if another node is down during the node replacement period, we will
>> get availability drop because most of our use case is local_quorum with
>> replication factor 3.
>>
>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Have you thought of using "-Dcassandra.replace_address_first_boot=..."
>>> (or "-Dcassandra.replace_address=..." if you are using an older version)?
>>> This will not result in a topology change, which means "nodetool cleanup"
>>> is not needed after the operation is completed.
>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote:
>>>
>>> Thanks, Jeff!
>>> But in our environment we replace nodes quite often for various
>>> optimization purposes, etc. say, almost 1 node per day (node *addition*
>>> followed by node *decommission*, which of course changes the topology),
>>> and we have a cluster of size 100 nodes with 300GB per node. If we have to
>>> run cleanup on 100 nodes after every replacement, then it could take
>>> forever.
>>> What is the recommendation until we get this fixed in Cassandra itself
>>> as part of compaction (w/o externally triggering *cleanup*)?
>>>
>>> Jaydeep
>>>
>>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa  wrote:
>>>
>>>> Cleanup is fast and cheap and basically a no-op if you haven’t changed
>>>> the ring
>>>>
>>>> After cassandra has transactional cluster metadata to make ring changes
>>>> strongly consistent, cassandra should do this in every compaction. But
>>>> until then it’s left for operators to run when they’re sure the state of
>>>> the ring is correct .
>>>>
>>>>
>>>>
>>>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia <
>>>> chovatia.jayd...@gmail.com> wrote:
>>>>
>>>> 
>>>> Isn't this considered a kind of *bug* in Cassandra because as we know
>>>> *cleanup* is a lengthy and un

Re: Is cleanup is required if cluster topology changes

2023-05-04 Thread Jeff Jirsa

You should 100% trigger cleanup each time or you’ll almost certainly resurrect data sooner or laterIf you’re using leveled compaction it’s especially cheap. Stcs and twcs are worse, but if you’re really scaling that often, I’d be considering lcs and running cleanup just before or just after each scaling On May 4, 2023, at 9:25 PM, Jaydeep Chovatia  wrote:Thanks, Jeff!But in our environment we replace nodes quite often for various optimization purposes, etc. say, almost 1 node per day (node addition followed by node decommission, which of course changes the topology), and we have a cluster of size 100 nodes with 300GB per node. If we have to run cleanup on 100 nodes after every replacement, then it could take forever. What is the recommendation until we get this fixed in Cassandra itself as part of compaction (w/o externally triggering cleanup)?JaydeepOn Thu, May 4, 2023 at 8:14 PM Jeff Jirsa <jji...@gmail.com> wrote:Cleanup is fast and cheap and basically a no-op if you haven’t changed the ring After cassandra has transactional cluster metadata to make ring changes strongly consistent, cassandra should do this in every compaction. But until then it’s left for operators to run when they’re sure the state of the ring is correct .On May 4, 2023, at 7:41 PM, Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote:Isn't this considered a kind of bug in Cassandra because as we know cleanup is a lengthy and unreliable operation, so relying on the cleanup means higher chances of data resurrection?Do you think we should discard the unowned token-ranges as part of the regular compaction itself? What are the pitfalls of doing this as part of compaction itself?JaydeepOn Thu, May 4, 2023 at 7:25 PM guo Maxwell <cclive1...@gmail.com> wrote:compact ion will just merge duplicate data and remove delete data in this node .if you add or remove one node for the cluster, I think clean up is needed. if clean up failed, I think we should come to see the reason.Runtian Liu <curly...@gmail.com> 于2023年5月5日周五 06:37写道：Hi all,Is cleanup the sole method to remove data that does not belong to a specific node? In a cluster, where nodes are added or decommissioned from time to time, failure to run cleanup may lead to data resurrection issues, as deleted data may remain on the node that lost ownership of certain partitions. Or is it true that normal compactions can also handle data removal for nodes that no longer have ownership of certain data?Thanks,Runtian
-- you are the apple of my eye !

Re: Is cleanup is required if cluster topology changes

2023-05-04 Thread Jeff Jirsa

Cleanup is fast and cheap and basically a no-op if you haven’t changed the ring After cassandra has transactional cluster metadata to make ring changes strongly consistent, cassandra should do this in every compaction. But until then it’s left for operators to run when they’re sure the state of the ring is correct .On May 4, 2023, at 7:41 PM, Jaydeep Chovatia  wrote:Isn't this considered a kind of bug in Cassandra because as we know cleanup is a lengthy and unreliable operation, so relying on the cleanup means higher chances of data resurrection?Do you think we should discard the unowned token-ranges as part of the regular compaction itself? What are the pitfalls of doing this as part of compaction itself?JaydeepOn Thu, May 4, 2023 at 7:25 PM guo Maxwell  wrote:compact ion will just merge duplicate data and remove delete data in this node .if you add or remove one node for the cluster, I think clean up is needed. if clean up failed, I think we should come to see the reason.Runtian Liu  于2023年5月5日周五 06:37写道：Hi all,Is cleanup the sole method to remove data that does not belong to a specific node? In a cluster, where nodes are added or decommissioned from time to time, failure to run cleanup may lead to data resurrection issues, as deleted data may remain on the node that lost ownership of certain partitions. Or is it true that normal compactions can also handle data removal for nodes that no longer have ownership of certain data?Thanks,Runtian
-- you are the apple of my eye !

Re: CAS operation result is unknown - proposal accepted by 1 but not a quorum

2023-04-12 Thread Jeff Jirsa

Are you always inserting into the same partition (with contention) or different 
?

Which version are you using ? 

The short tldr is that the failure modes of the existing paxos implementation 
(under contention, under latency, under cluster strain) can cause undefined 
states. I believe that a subsequent serial read will deterministically resolve 
the state (look at cassandra-12126), but that has a cost (both the extra 
operation and the code complexity)

The upcoming transactional rewrite will likely change this, but it’s still WIP 
(CEP-15)

> On Apr 12, 2023, at 6:11 AM, Ralph Boehme  wrote:
> 
> On 4/11/23 21:14, Ralph Boehme wrote:
>>> On 4/11/23 19:53, Bowen Song via user wrote:
>>> That error message sounds like one of the nodes timed out in the paxos 
>>> propose stage.  You can check the system.log and gc.log and see if you can 
>>> find anything unusual in them, such as network errors, out of sync clocks 
>>> or long stop-the-world GC pauses.
>> hm, I'll check the logs, but I can reproduce this 100% on an idle test 
>> cluster just by running a simple test client that generates a smallish 
>> workload where just 2 processes on a single host hammer the Cassandra 
>> cluster with LWTs.
> 
> nothing in the logs really.
> 
>> Maybe LWTs are not meant to be used this way?
> 
> fwiw, this happens 100% within a few seconds with a worload where two clients 
> hammer with LWTs on a single row.
> 
> Thanks!
> -slow
>

Re: When are sstables that were compacted deleted?

2023-04-04 Thread Jeff Jirsa

You will DEFINITELY not remove sstables obsoleted by compaction if they are
being streamed out to neighbors. It would also not surprise me that if you
have something holding a background reference to one of the sstables in the
oldest/older compaction transaction logs, that the whole process may block
waiting on the tidier to clean that up.

Things that may hold references:
- Validation compactions (repair)
- Index build/rebuild
- Streaming (repair, bootstrap, move, decom)

If you have repairs running, you can try pausing/cancelling them and/or
stopping validation/index_build compactions.



On Tue, Apr 4, 2023 at 2:29 PM Gil Ganz  wrote:

> If it was just one instance I would just bounce it but the thing is this
> happens when we join/remove nodes, and we have a lot of servers with this
> issue (while before the join/remove we are on ~50% disk usage).
> We found ourselves fighting with compaction to make sure we don't run out
> of space.
> Will open a ticket, thanks.
>
>
> On Wed, Apr 5, 2023 at 12:10 AM Jeff Jirsa  wrote:
>
>> If you restart the node, it'll process/purge those compaction logs on
>> startup, but you want them to purge/process now.
>>
>> I genuinely dont know when the tidier runs, but it's likely the case that
>> you're too busy compaction to purge (though I don't know what exactly "too
>> busy" means).
>>
>> Since you're close to 95% disk full, bounce one instance at a time to
>> recover the space, but we probably need a JIRA to understand exactly what's
>> blocking the tidier from running.
>>
>>
>>
>> On Tue, Apr 4, 2023 at 1:55 PM Gil Ganz  wrote:
>>
>>> More information - from another node in the cluster
>>>
>>> I can see many txn files although I only have two compactions running.
>>> [user@server808 new_table-44263b406bf111ed8bd9df80ace3677a]# ls -l *txn*
>>> -rw-r--r-- 1 cassandra cassandra 613 Apr  4 05:26
>>> nb_txn_compaction_09e3aa40-d2a7-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 461 Apr  4 10:17
>>> nb_txn_compaction_11433360-d265-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 463 Apr  4 09:48
>>> nb_txn_compaction_593e5320-d265-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 614 Apr  3 22:47
>>> nb_txn_compaction_701d62d0-d264-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 136 Apr  3 22:27
>>> nb_txn_compaction_bb770b50-d26e-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 463 Apr  4 09:23
>>> nb_txn_compaction_ce51bfe0-d264-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 134 Apr  4 10:31
>>> nb_txn_compaction_d17c7380-d2d3-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 464 Apr  4 09:24
>>> nb_txn_compaction_ed7fc650-d264-11ed-b76b-3b279f6334bc.log
>>> -rw-r--r-- 1 cassandra cassandra 613 Apr  3 22:54
>>> nb_txn_compaction_f456f3b0-d271-11ed-b76b-3b279f6334bc.log
>>>
>>> Let's take for example the one from "Apr  4 09:24"
>>> I can see the matching log message in system.log
>>>
>>> INFO  [CompactionExecutor:142085] 2023-04-04 09:24:29,892
>>> CompactionTask.java:241 - Compacted (ed7fc650-d264-11ed-b76b-3b279f6334bc)
>>> 2 sstables to
>>> [/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12142-big,]
>>> to level=0.  362.987GiB to 336.323GiB (~92% of original) in 43,625,742ms.
>>> Read Throughput = 8.520MiB/s, Write Throughput = 7.894MiB/s, Row Throughput
>>> = ~-11,482/s.  3,755,353,838 total partitions merged to 3,479,484,261.
>>> Partition merge counts were {1:3203614684, 2:275869577, }
>>>
>>>
>>> [user@server808 new_table-44263b406bf111ed8bd9df80ace3677a]# cat
>>> nb_txn_compaction_ed7fc650-d264-11ed-b76b-3b279f6334bc.log
>>>
>>> ADD:[/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12142-big-,0,8][2633027732]
>>>
>>> REMOVE:[/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-,1680543675071,8][4190572643]
>>>
>>> REMOVE:[/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12109-big-,1680554352704,8][3101929253]
>>> COMMIT:[,0,0][2613697770]
>>>
>>> I would expect sstable 10334 to be gone, since compaction finished an
>>> hour ago, but files are still there.
>>>
>>>
>>> [user@server808 new_table-44263b406bf111ed8bd9df80ace3677a]# ls -l
>>> /var/lib/cassandra/data/disk1/kt_ks/new_table-44

Re: When are sstables that were compacted deleted?

2023-04-04 Thread Jeff Jirsa

If you restart the node, it'll process/purge those compaction logs on
startup, but you want them to purge/process now.

I genuinely dont know when the tidier runs, but it's likely the case that
you're too busy compaction to purge (though I don't know what exactly "too
busy" means).

Since you're close to 95% disk full, bounce one instance at a time to
recover the space, but we probably need a JIRA to understand exactly what's
blocking the tidier from running.



On Tue, Apr 4, 2023 at 1:55 PM Gil Ganz  wrote:

> More information - from another node in the cluster
>
> I can see many txn files although I only have two compactions running.
> [user@server808 new_table-44263b406bf111ed8bd9df80ace3677a]# ls -l *txn*
> -rw-r--r-- 1 cassandra cassandra 613 Apr  4 05:26
> nb_txn_compaction_09e3aa40-d2a7-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 461 Apr  4 10:17
> nb_txn_compaction_11433360-d265-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 463 Apr  4 09:48
> nb_txn_compaction_593e5320-d265-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 614 Apr  3 22:47
> nb_txn_compaction_701d62d0-d264-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 136 Apr  3 22:27
> nb_txn_compaction_bb770b50-d26e-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 463 Apr  4 09:23
> nb_txn_compaction_ce51bfe0-d264-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 134 Apr  4 10:31
> nb_txn_compaction_d17c7380-d2d3-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 464 Apr  4 09:24
> nb_txn_compaction_ed7fc650-d264-11ed-b76b-3b279f6334bc.log
> -rw-r--r-- 1 cassandra cassandra 613 Apr  3 22:54
> nb_txn_compaction_f456f3b0-d271-11ed-b76b-3b279f6334bc.log
>
> Let's take for example the one from "Apr  4 09:24"
> I can see the matching log message in system.log
>
> INFO  [CompactionExecutor:142085] 2023-04-04 09:24:29,892
> CompactionTask.java:241 - Compacted (ed7fc650-d264-11ed-b76b-3b279f6334bc)
> 2 sstables to
> [/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12142-big,]
> to level=0.  362.987GiB to 336.323GiB (~92% of original) in 43,625,742ms.
> Read Throughput = 8.520MiB/s, Write Throughput = 7.894MiB/s, Row Throughput
> = ~-11,482/s.  3,755,353,838 total partitions merged to 3,479,484,261.
> Partition merge counts were {1:3203614684, 2:275869577, }
>
>
> [user@server808 new_table-44263b406bf111ed8bd9df80ace3677a]# cat
> nb_txn_compaction_ed7fc650-d264-11ed-b76b-3b279f6334bc.log
>
> ADD:[/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12142-big-,0,8][2633027732]
>
> REMOVE:[/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-,1680543675071,8][4190572643]
>
> REMOVE:[/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12109-big-,1680554352704,8][3101929253]
> COMMIT:[,0,0][2613697770]
>
> I would expect sstable 10334 to be gone, since compaction finished an hour
> ago, but files are still there.
>
>
> [user@server808 new_table-44263b406bf111ed8bd9df80ace3677a]# ls -l
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-*
> -rw-r--r-- 1 cassandra cassandra315582480 Mar 24 16:46
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-CompressionInfo.db
> -rw-r--r-- 1 cassandra cassandra 361124597166 Mar 24 16:46
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-Data.db
> -rw-r--r-- 1 cassandra cassandra   10 Mar 24 16:46
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-Digest.crc32
> -rw-r--r-- 1 cassandra cassandra   4316334488 Mar 24 16:46
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-Filter.db
> -rw-r--r-- 1 cassandra cassandra 283651305837 Mar 24 16:46
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-Index.db
> -rw-r--r-- 1 cassandra cassandra11934 Mar 24 16:46
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-Statistics.db
> -rw-r--r-- 1 cassandra cassandra763353245 Apr  3 17:41
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-Summary.db
> -rw-r--r-- 1 cassandra cassandra   92 Mar 24 16:46
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-10334-big-TOC.txt
> [user@server808 new_table-44263b406bf111ed8bd9df80ace3677a]# ls -l
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12142-big-*
> -rw-r--r-- 1 cassandra cassandra315582480 Apr  4 09:24
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12142-big-CompressionInfo.db
> -rw-r--r-- 1 cassandra cassandra 361124597166 Apr  4 09:24
> /var/lib/cassandra/data/disk1/kt_ks/new_table-44263b40

Re: Reads not returning data after adding node

2023-04-03 Thread Jeff Jirsa

Because executing “removenode” streamed extra data from live nodes to the “gaining” replicaOversimplified (if you had one token per node) If you start with A B CThen add DD should bootstrap a range from each of A B and C, but at the end, some of the data that was A B C becomes B C DWhen you removenode, you tell B and C to send data back to A. A B and C will eventually contact that data away. Eventually. If you get around to adding D again, running “cleanup” when you’re done (successfully) will remove a lot of it. On Apr 3, 2023, at 8:14 PM, David Tinker wrote:Looks like the remove has sorted things out. Thanks.One thing I am wondering about is why the nodes are carrying a lot more data? The loads were about 2.7T before, now 3.4T. # nodetool statusDatacenter: dc1===Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns (effective) Host ID RackUN xxx.xxx.xxx.105 3.4 TiB 256 100.0% afd02287-3f88-4c6f-8b27-06f7a8192402 rack3UN xxx.xxx.xxx.253 3.34 TiB 256 100.0% e1af72be-e5df-4c6b-a124-c7bc48c6602a rack2UN xxx.xxx.xxx.107 3.44 TiB 256 100.0% ab72f017-be96-41d2-9bef-a551dec2c7b5 rack1On Mon, Apr 3, 2023 at 5:42 PM Bowen Song via user <user@cassandra.apache.org> wrote:

That's correct. nodetool removenode
is strongly preferred when your node is already down. If the node
is still functional, use nodetool
decommission on the node instead.

On 03/04/2023 16:32, Jeff Jirsa wrote:

FWIW, `nodetool decommission` is strongly
preferred. `nodetool removenode` is designed to be run when a
host is offline. Only decommission is guaranteed to maintain
consistency / correctness, and removemode probably streams a
lot more data around than decommission.

On Mon, Apr 3, 2023 at
6:47 AM Bowen Song via user <user@cassandra.apache.org>
wrote:

Use nodetool removenode
is strongly preferred in most circumstances, and only
resort to assassinate if
you do not care about data consistency or you know there
won't be any consistency issue (e.g. no new writes and
did not run nodetool cleanup).
Since the size of data on the new node is small, nodetool removenode should
finish fairly quickly and bring your cluster back.
Next time when you are doing something like this again,
please test it out on a non-production environment, make
sure everything works as expected before moving onto the
production.

On 03/04/2023 06:28, David Tinker wrote:

Should I use assassinate or removenode?
Given that there is some data on the node. Or will
that be found on the other nodes? Sorry for all the
questions but I really don't want to mess up.

On Mon, Apr 3, 2023
at 7:21 AM Carlos Diaz <crdiaz...@gmail.com>
wrote:

That's what nodetool assassinte will
do.

On Sun, Apr 2,
2023 at 10:19 PM David Tinker <david.tin...@gmail.com>
wrote:

Is it possible for me to remove
the node from the cluster i.e. to undo this
mess and get the cluster operating again?

On Mon, Apr
3, 2023 at 7:13 AM Carlos Diaz <crdiaz...@gmail.com>
wrote:

You can leave it in the seed
list of the other nodes, just make sure
it's not included in this node's seed
list. However, if you do decide to fix
the issue with the racks first assassinate
this node (nodetool assassinate
), and update the rack name
before you restart.

On Sun,
Apr 2, 2023 at 10:0

Re: Reads not returning data after adding node

2023-04-03 Thread Jeff Jirsa

FWIW, `nodetool decommission` is strongly preferred. `nodetool removenode`
is designed to be run when a host is offline. Only decommission is
guaranteed to maintain consistency / correctness, and removemode probably
streams a lot more data around than decommission.


On Mon, Apr 3, 2023 at 6:47 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Use nodetool removenode is strongly preferred in most circumstances, and
> only resort to assassinate if you do not care about data consistency or
> you know there won't be any consistency issue (e.g. no new writes and did
> not run nodetool cleanup).
>
> Since the size of data on the new node is small, nodetool removenode
> should finish fairly quickly and bring your cluster back.
>
> Next time when you are doing something like this again, please test it out
> on a non-production environment, make sure everything works as expected
> before moving onto the production.
>
>
> On 03/04/2023 06:28, David Tinker wrote:
>
> Should I use assassinate or removenode? Given that there is some data on
> the node. Or will that be found on the other nodes? Sorry for all the
> questions but I really don't want to mess up.
>
> On Mon, Apr 3, 2023 at 7:21 AM Carlos Diaz  wrote:
>
>> That's what nodetool assassinte will do.
>>
>> On Sun, Apr 2, 2023 at 10:19 PM David Tinker 
>> wrote:
>>
>>> Is it possible for me to remove the node from the cluster i.e. to undo
>>> this mess and get the cluster operating again?
>>>
>>> On Mon, Apr 3, 2023 at 7:13 AM Carlos Diaz  wrote:
>>>
 You can leave it in the seed list of the other nodes, just make sure
 it's not included in this node's seed list.  However, if you do decide to
 fix the issue with the racks first assassinate this node (nodetool
 assassinate ), and update the rack name before you restart.

 On Sun, Apr 2, 2023 at 10:06 PM David Tinker 
 wrote:

> It is also in the seeds list for the other nodes. Should I remove it
> from those, restart them one at a time, then restart it?
>
> /etc/cassandra # grep -i bootstrap *
> doesn't show anything so I don't think I have auto_bootstrap false.
>
> Thanks very much for the help.
>
>
> On Mon, Apr 3, 2023 at 7:01 AM Carlos Diaz 
> wrote:
>
>> Just remove it from the seed list in the cassandra.yaml file and
>> restart the node.  Make sure that auto_bootstrap is set to true first
>> though.
>>
>> On Sun, Apr 2, 2023 at 9:59 PM David Tinker 
>> wrote:
>>
>>> So likely because I made it a seed node when I added it to the
>>> cluster it didn't do the bootstrap process. How can I recover this?
>>>
>>> On Mon, Apr 3, 2023 at 6:41 AM David Tinker 
>>> wrote:
>>>
 Yes replication factor is 3.

 I ran nodetool repair -pr on all the nodes (one at a time) and am
 still having issues getting data back from queries.

 I did make the new node a seed node.

 Re "rack4": I assumed that was just an indication as to the
 physical location of the server for redundancy. This one is separate 
 from
 the others so I used rack4.

 On Mon, Apr 3, 2023 at 6:30 AM Carlos Diaz 
 wrote:

> I'm assuming that your replication factor is 3.  If that's the
> case, did you intentionally put this node in rack 4?  Typically, you 
> want
> to add nodes in multiples of your replication factor in order to keep 
> the
> "racks" balanced.  In other words, this node should have been added 
> to rack
> 1, 2 or 3.
>
> Having said that, you should be able to easily fix your problem by
> running a nodetool repair -pr on the new node.
>
> On Sun, Apr 2, 2023 at 8:16 PM David Tinker <
> david.tin...@gmail.com> wrote:
>
>> Hi All
>>
>> I recently added a node to my 3 node Cassandra 4.0.5 cluster and
>> now many reads are not returning rows! What do I need to do to fix 
>> this?
>> There weren't any errors in the logs or other problems that I could 
>> see. I
>> expected the cluster to balance itself but this hasn't happened 
>> (yet?). The
>> nodes are similar so I have num_tokens=256 for each. I am using the
>> Murmur3Partitioner.
>>
>> # nodetool status
>> Datacenter: dc1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address  Load   Tokens  Owns (effective)  Host ID
>>   Rack
>> UN  xxx.xxx.xxx.105  2.65 TiB   256 72.9%
>> afd02287-3f88-4c6f-8b27-06f7a8192402  rack3
>> UN  xxx.xxx.xxx.253  2.6 TiB256 73.9%
>> e1af72be-e5df-4c6b-a124-c7bc48c6602a  rack2
>> UN  xxx.xxx.xxx.24   93.82 KiB  2

Re: Understanding rack in cassandra-rackdc.properties

2023-04-03 Thread Jeff Jirsa

As long as the number of racks is already at/above the number of nodes /
replication factor, it's gonna be fine.

Where it tends to surprise people is if you have RF=3 and either 1 or 2
racks, and then you add a third, that third rack gets one copy of "all" of
the data, so you often run out of disk space.

If you're already at 3 nodes / 3 racks / RF=3, you're already evenly
distributed, the next (4th, 5th, 6th) racks will just be randomly assigned
based on the random token allocation.

On Mon, Apr 3, 2023 at 8:12 AM David Tinker  wrote:

> I have a 3 node cluster using the GossipingPropertyFileSnitch and
> replication factor of 3. All nodes are leased hardware and more or less the
> same. The cassandra-rackdc.properties files look like this:
>
> dc=dc1
> rack=rack1
> (rack2 and rack3 for the other nodes)
>
> Now I need to expand the cluster. I was going to use rack4 for the next
> node, then rack5 and rack6 because the nodes are physically all on
> different racks. Elsewhere on this list someone mentioned that I should use
> rack1, rack2 and rack3 again.
>
> Why is that?
>
> Thanks
> David
>
>

Re: Reads not returning data after adding node

2023-04-02 Thread Jeff Jirsa

Just run nodetool rebuild on the new node If you assassinate it now you violate consistency for your most recent writes On Apr 2, 2023, at 10:22 PM, Carlos Diaz  wrote:That's what nodetool assassinte will do.On Sun, Apr 2, 2023 at 10:19 PM David Tinker  wrote:Is it possible for me to remove the node from the cluster i.e. to undo this mess and get the cluster operating again?On Mon, Apr 3, 2023 at 7:13 AM Carlos Diaz  wrote:You can leave it in the seed list of the other nodes, just make sure it's not included in this node's seed list.  However, if you do decide to fix the issue with the racks first assassinate this node (nodetool assassinate ), and update the rack name before you restart.  On Sun, Apr 2, 2023 at 10:06 PM David Tinker  wrote:It is also in the seeds list for the other nodes. Should I remove it from those, restart them one at a time, then restart it?/etc/cassandra # grep -i bootstrap *doesn't show anything so I don't think I have auto_bootstrap false.Thanks very much for the help.On Mon, Apr 3, 2023 at 7:01 AM Carlos Diaz  wrote:Just remove it from the seed list in the cassandra.yaml file and restart the node.  Make sure that auto_bootstrap is set to true first though.  On Sun, Apr 2, 2023 at 9:59 PM David Tinker  wrote:So likely because I made it a seed node when I added it to the cluster it didn't do the bootstrap process. How can I recover this?On Mon, Apr 3, 2023 at 6:41 AM David Tinker  wrote:Yes replication factor is 3.I ran nodetool repair -pr on all the nodes (one at a time) and am still having issues getting data back from queries.I did make the new node a seed node.Re "rack4": I assumed that was just an indication as to the physical location of the server for redundancy. This one is separate from the others so I used rack4.On Mon, Apr 3, 2023 at 6:30 AM Carlos Diaz  wrote:I'm assuming that your replication factor is 3.  If that's the case, did you intentionally put this node in rack 4?  Typically, you want to add nodes in multiples of your replication factor in order to keep the "racks" balanced.  In other words, this node should have been added to rack 1, 2 or 3.  Having said that, you should be able to easily fix your problem by running a nodetool repair -pr on the new node.  On Sun, Apr 2, 2023 at 8:16 PM David Tinker  wrote:Hi AllI recently added a node to my 3 node Cassandra 4.0.5 cluster and now many reads are not returning rows! What do I need to do to fix this? There weren't any errors in the logs or other problems that I could see. I expected the cluster to balance itself but this hasn't happened (yet?). The nodes are similar so I have num_tokens=256 for each. I am using the Murmur3Partitioner.# nodetool statusDatacenter: dc1===Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address          Load       Tokens  Owns (effective)  Host ID                               RackUN  xxx.xxx.xxx.105  2.65 TiB   256     72.9%             afd02287-3f88-4c6f-8b27-06f7a8192402  rack3UN  xxx.xxx.xxx.253  2.6 TiB    256     73.9%             e1af72be-e5df-4c6b-a124-c7bc48c6602a  rack2UN  xxx.xxx.xxx.24   93.82 KiB  256     80.0%             c4e8b4a0-f014-45e6-afb4-648aad4f8500  rack4UN  xxx.xxx.xxx.107  2.65 TiB   256     73.2%             ab72f017-be96-41d2-9bef-a551dec2c7b5  rack1# nodetool netstatsMode: NORMALNot sending any streams.Read Repair Statistics:Attempted: 0Mismatch (Blocking): 0Mismatch (Background): 0Pool Name                    Active   Pending      Completed   DroppedLarge messages                  n/a         0          71754         0Small messages                  n/a         0        8398184        14Gossip messages                 n/a         0        1303634         0# nodetool ringDatacenter: dc1==Address               Rack        Status State   Load            Owns                Token                                                                                     9189523899826545641xxx.xxx.xxx.24        rack4       Up     Normal  93.82 KiB       79.95%              -9194674091837769168xxx.xxx.xxx.107       rack1       Up     Normal  2.65 TiB        73.25%              -9168781258594813088xxx.xxx.xxx.253       rack2       Up     Normal  2.6 TiB         73.92%              -9163037340977721917xxx.xxx.xxx.105       rack3       Up     Normal  2.65 TiB        72.88%              -9148860739730046229xxx.xxx.xxx.107       rack1       Up     Normal  2.65 TiB        73.25%              -9125240034139323535xxx.xxx.xxx.253       rack2       Up     Normal  2.6 TiB         73.92%              -9112518853051755414xxx.xxx.xxx.105       rack3       Up     Normal  2.65 TiB        72.88%              -9100516173422432134...This is causing a serious production issue. Please help if you can.ThanksDavid

Re: Reads not returning data after adding node

2023-04-02 Thread Jeff Jirsa

Looks like it joined with no data. Did you set auto_bootstrap to false? Or does 
the node think it’s a seed?

You want to use “nodetool rebuild” to stream data to that host. 

You can potentially end the production outage / incident by taking the host 
offline, or making it less likely to be queried (disable binary on that host 
and if you know how, use jmx to set severity to an arbitrarily high number) 

> On Apr 2, 2023, at 8:16 PM, David Tinker  wrote:
> 
> 
> Hi All
> 
> I recently added a node to my 3 node Cassandra 4.0.5 cluster and now many 
> reads are not returning rows! What do I need to do to fix this? There weren't 
> any errors in the logs or other problems that I could see. I expected the 
> cluster to balance itself but this hasn't happened (yet?). The nodes are 
> similar so I have num_tokens=256 for each. I am using the Murmur3Partitioner.
> 
> # nodetool status
> Datacenter: dc1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens  Owns (effective)  Host ID 
>   Rack
> UN  xxx.xxx.xxx.105  2.65 TiB   256 72.9% 
> afd02287-3f88-4c6f-8b27-06f7a8192402  rack3
> UN  xxx.xxx.xxx.253  2.6 TiB256 73.9% 
> e1af72be-e5df-4c6b-a124-c7bc48c6602a  rack2
> UN  xxx.xxx.xxx.24   93.82 KiB  256 80.0% 
> c4e8b4a0-f014-45e6-afb4-648aad4f8500  rack4
> UN  xxx.xxx.xxx.107  2.65 TiB   256 73.2% 
> ab72f017-be96-41d2-9bef-a551dec2c7b5  rack1
> 
> # nodetool netstats
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool NameActive   Pending  Completed   Dropped
> Large messages  n/a 0  71754 0
> Small messages  n/a 0839818414
> Gossip messages n/a 01303634 0
> 
> # nodetool ring
> Datacenter: dc1
> ==
> Address   RackStatus State   LoadOwns 
>Token
>   
>9189523899826545641
> xxx.xxx.xxx.24rack4   Up Normal  93.82 KiB   79.95%   
>-9194674091837769168
> xxx.xxx.xxx.107   rack1   Up Normal  2.65 TiB73.25%   
>-9168781258594813088
> xxx.xxx.xxx.253   rack2   Up Normal  2.6 TiB 73.92%   
>-9163037340977721917
> xxx.xxx.xxx.105   rack3   Up Normal  2.65 TiB72.88%   
>-9148860739730046229
> xxx.xxx.xxx.107   rack1   Up Normal  2.65 TiB73.25%   
>-9125240034139323535
> xxx.xxx.xxx.253   rack2   Up Normal  2.6 TiB 73.92%   
>-9112518853051755414
> xxx.xxx.xxx.105   rack3   Up Normal  2.65 TiB72.88%   
>-9100516173422432134
> ...
> 
> This is causing a serious production issue. Please help if you can.
> 
> Thanks
> David
> 
> 
>

Re: Nodetool command to pre-load the chunk cache

2023-03-21 Thread Jeff Jirsa

We serialize the other caches to disk to avoid cold-start problems, I don't
see why we couldn't also serialize the chunk cache? Seems worth a JIRA to
me.

Until then, you can probably use the dynamic snitch (badness + severity) to
route around newly started hosts.

I'm actually pretty surprised the chunk cache is that effective, sort of
nice to know.



On Tue, Mar 21, 2023 at 10:17 AM Carlos Diaz  wrote:

> Hi Team,
>
> We are heavy users of Cassandra at a pretty big bank.  Security measures
> require us to constantly refresh our C* nodes every x number of days.  We
> normally do this in a rolling fashion, taking one node down at a time and
> then refreshing it with a new instance.  This process has been working for
> us great for the past few years.
>
> However, we recently started having issues when a newly refreshed instance
> comes back online, our automation waits a few minutes for the node to
> become "ready (UN)" and then moves on to the next node.  The problem that
> we are facing is that when the node is ready, the chunk cache is still
> empty so when the node starts accepting new connections, queries that go to
> take much longer to respond and this causes errors for our apps.
>
> I was thinking that it would be great if we had a nodetool command that
> would allow us to prefetch a certain table or a set of tables to preload
> the chunk cache.  Then we could simply add another check (nodetool info?),
> to ensure that the chunk cache has been preloaded enough to handle queries
> to this particular node.
>
> Would love to hear others' feedback on the feasibility of this idea.
>
> Thanks!
>
>
>
>

Re: Cassandra in Kubernetes: IP switch decommission issue

2023-03-09 Thread Jeff Jirsa

I described something roughly similar to this a few years ago on the list.
The specific chain you're describing isn't one I've thought about before,
but if you open a JIRA for tracking and attribution, I'll ask some folks to
take a peek at it.



On Thu, Mar 9, 2023 at 10:57 AM Inès Potier  wrote:

> Hi Cassandra community,
>
> Reaching out again in case anyone has recently faced the below issue.
> Additional opinions on this would be super helpful for us.
>
> Thanks in advance,
> Ines
>
> On Thu, Feb 23, 2023 at 3:40 PM Inès Potier 
> wrote:
>
>> Hi Cassandra community,
>>
>> We have recently encountered a recurring old IP reappearance issue while
>> testing decommissions on some of our Kubernetes Cassandra staging clusters.
>> We have not yet found other references to this issue online. We could
>> really use some additional inputs/opinions, both on the problem itself and
>> the fix we are currently considering.
>>
>> *Issue Description*
>>
>> In Kubernetes, a Cassandra node can change IP at each pod bounce. We have
>> noticed that this behavior, associated with a decommission operation, can
>> get the cluster into an erroneous state.
>>
>> Consider the following situation: a Cassandra node node1 , with hostId1,
>> owning 20.5% of the token ring, bounces and switches IP (old_IP → new_IP).
>> After a couple gossip iterations, all other nodes’ nodetool status output
>> includes a new_IP UN entry owning 20.5% of the token ring and no old_IP
>> entry.
>>
>> Shortly after the bounce, node1 gets decommissioned. Our cluster does
>> not have a lot of data, and the decommission operation completes pretty
>> quickly. Logs on other nodes start showing acknowledgment that node1 has
>> left and soon, nodetool status’ new_IP UL entry disappears. node1 ‘s pod
>> is deleted.
>>
>> After a minute delay, the cluster enters the erroneous state. An  old_IP DN
>> entry reappears in nodetool status, owning 20.5% of the token ring. No node
>> owns this IP anymore and according to logs, old_IP is still associated
>> with hostId1.
>>
>> *Issue Root Cause*
>>
>> By digging through Cassandra logs, and re-testing this scenario over and
>> over again, we have reached the following conclusion:
>>
>>- Other nodes will continue exchanging gossip about old_IP , even
>>after it becomes a fatClient.
>>- The fatClient timeout and subsequent quarantine does not stop old_IP
>> from reappearing in a node’s Gossip state, once its quarantine is
>>over. We believe that this is due to a misalignment on all nodes’
>>old_IP expiration time.
>>- Once new_IP has left the cluster, and old_IP next gossip state
>>message is received by a node, StorageService will no longer face
>>collisions (or will, but with an even older IP) for hostId1 and its
>>corresponding tokens. As a result, old_IP will regain ownership of
>>20.5% of the token ring.
>>
>>
>> *Proposed fix*
>>
>> Following the above investigation, we were thinking about implementing
>> the following fix:
>>
>> When a node receives a gossip status change with STATE_LEFT for a
>> leaving endpoint new_IP, before evicting new_IP from the token ring,
>> purge from Gossip (ie evictFromMembership) all endpoints that meet the
>> following criteria:
>>
>>- endpointStateMap contains this endpoint
>>- The endpoint is not currently a token owner (
>>!tokenMetadata.isMember(endpoint))
>>- The endpoint’s hostId matches the hostId of new_IP
>>- The endpoint is older than leaving_IP (
>>Gossiper.instance.compareEndpointStartup)
>>- The endpoint’s token range (from endpointStateMap) intersects with
>>new_IP’s
>>
>> This modification’s intention is to force nodes to realign on old_IP 
>> expiration,
>> and expunge it from Gossip so it does not reappear after new_IP leaves
>> the ring.
>>
>>
>> Additional opinions/ideas regarding the fix’s viability and the issue
>> itself would be really helpful.
>> Thanks in advance,
>> Ines
>>
>

Re: Replacing node w/o bootstrapping (streaming)?

2023-02-09 Thread Jeff Jirsa

You don’t have to do anything else. Just use smart rsync flags (including 
delete).

It’ll work fine just the way you described. No special start args. No 
replacement flag 

Be sure you rsync the commitlog directory too . Flush and drain to be extra safe



> On Feb 9, 2023, at 6:42 PM, Max Campos  wrote:
> 
> Hi -
> 
> We have a node whose root partition is flaking out.  The disk that contains 
> the Cassandra data, however, is healthy.
> 
> We’d like to replace the dying node with a procedure like this:
> 
> 0) OLD node is running, NEW node has never started Cassandra
> 1) rsync Cassandra data from OLD node to NEW  (1.2TB)
> 2) Shutdown OLD node
> 3) rsync any remaining Cassandra data from OLD to NEW
> 4) Startup NEW node for the first time and have it take the place of the OLD 
> node in the cluster
> 
> The goal here is to eliminate bootstrapping (streaming), because it’s a 1 for 
> 1 swap and we can easily rsync all of the data over to the new node in 
> advance.
> 
> Questions: 
> 
> What do I need to do in step 4 to get Cassandra to take over the place of the 
> old node?
> 
> Is this a wise idea?  Or should I just bite the bullet and use 
> “-Dcassandra.replace_address” and do the bootstrapping (streaming)?  I have 
> no idea how long it takes to stream 1.2 TB of data.  
> 
> Our cluster:
> v3.0.23
> 2 DC
> 3 nodes per DC
> RF=3
> CL=LOCAL_QUORUM
> 
> Thanks everyone.
> 
> - Max

Re: Deletions getting omitted

2023-02-04 Thread Jeff Jirsa

While you'd expect only_purge_repaired_tombstones:true to be sufficient,
your gc_grace_secnds of 1 hour is making you unusually susceptible to
resurrecting data.

(To be clear, you should be safe to do this, but if there is a bug hiding
in there somewhere, your low gc_grace_seconds will make it likely to
resurrect; if this is causing you problems, I'd try raising that first to
mitigate while you investigate the real cause).

If it's CASSANDRA-15690, a second read at consistency ALL may cause the
data to properly show up "deleted" (you dont want to use ALL all the time,
because it'll be an outage if you ever have a node go down). Given
CASSANDRA-15690
exists, you probably want to upgrade.



On Sat, Feb 4, 2023 at 4:56 PM shankha b 
wrote:

> We are facing an issue on one of our production systems where after we
> delete the data
> the data doesn't seem to get deleted. We have a Get call just after the
> delete call.
> The data shows up.
>
> Versions
>
> cassandra : 3.11.6
> gocqlx : v2 v2.1.0
>
>
> 1. Client Settings: LocalQuorum
> 2. Number of Nodes : 3
> 3. All 3 nodes up and running for weeks.
> 4. Inserts were done few days earlier. So there is good amount of time
> difference
> between Inserts and Deletes and Inserts have made through successfully.
>
>
> The Delete Call :
>
> q := s.session.Query(stmt, names).BindStruct(*customModel)
> err := q.ExecRelease()
>
> We do check the error and it is Nil.
> There are no exceptions during that time either on the client side or
> server side.
>
>
> The Get Call :
>
> q := s.session.Query(stmt, names).BindStruct(*customModel)
> err := q.GetRelease(customModel)
>
> This returns the data successfully.
>
> We do have these two options enabled.
> 1.
> https://docs.datastax.com/en/dse/6.8/dse-dev/datastax_enterprise/config/configCassandra_yaml.html#configCassandra_yaml__commitlog_sync
>
> batch - Send ACK signal for writes after the commit log has been
> flushed to disk. Each incoming write triggers the flush task.
>
> 2. only_purge_repaired_tombstones
>
> This does not happen for all the delete operations. For many of them, the
> delete seems
> to go through. This does not seem to be timing-related and the successful
> and unsuccessful ones
> are spread out.
>
>
> CASSANDRA-15690
> Single partition queries can mistakenly omit partition deletions and
> resurrect data
>
> I am trying to go through this PR and ticket. If you have any suggestions,
> please do let me know.
>
>
> The table structure is the following
>
> CREATE KEYSPACE cycling WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}  AND durable_writes = true;
>
> CREATE TABLE cycling.rider (
> uuid text,
> created_at timestamp,
> PRIMARY KEY (uuid, created_at)
> ) WITH CLUSTERING ORDER BY (created_at DESC)
> AND WITH bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND comment = ''
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4',
> 'only_purge_repaired_tombstones': 'true'}
> AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND crc_check_chance = 1.0
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 3600
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
>
> Thanks
>
>
>

Re: removenode stuck - cassandra 4.1.0

2023-01-23 Thread Jeff Jirsa

Those hosts are likely sending streams.

If you do `nodetool netstats` on the replicas of the node you're removing,
you should see byte counters and file counters - they should all be
incrementing. If one of them isnt incremening, that one is probably stuck.

There's at least one bug in 4.1 that can cause (I think? rate limiters) to
interact in a way that can cause this.
https://issues.apache.org/jira/browse/CASSANDRA-18110 describes it and has
a workaround.

On Mon, Jan 23, 2023 at 9:41 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> I had a drive fail (first drive in the list) on a Cassandra cluster.
> I've stopped the node (as it no longer starts), and am trying to remove
> it from the cluster, but the removenode command is hung (been running
> for 3 hours so far):
> nodetool removenode status is always reporting the same token as being
> removed.  Help?
>
> nodetool removenode status
> RemovalStatus: Removing token (-9196617215347134065). Waiting for
> replication confirmation from
> [/172.16.100.248,/172.16.100.249,/172.16.100.251,/172.16.100.252,/
> 172.16.100.34,/172.16.100.35,/172.16.100.36,/172.16.100.37,/172.16.100.38
> ,/172.16.100.42,/172.16.100.44,/172.16.100.45].
>
> Thanks.
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com
>

Re: Failed disks - correct procedure

2023-01-16 Thread Jeff Jirsa

Prior to cassandra-6696 you’d have to treat one missing disk as a failed 
machine, wipe all the data and re-stream it, as a tombstone for a given value 
may be on one disk and data on another (effectively redirecting data)

So the answer has to be version dependent, too - which version were you using? 

> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy  wrote:
> 
> Hi Joe,
> 
> Reading it back I realized I misunderstood that part of your email, so
> you must be using data_file_directories with 16 drives?  That's a lot
> of drives!  I imagine this may happen from time to time given that
> disks like to fail.
> 
> That's a bit of an interesting scenario that I would have to think
> about.  If you brought the node up without the bad drive, repairs are
> probably going to do a ton of repair overstreaming if you aren't using
> 4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may
> put things into a really bad state (lots of streaming = lots of
> compactions = slower reads) and you may be seeing some inconsistency
> if repairs weren't regularly running beforehand.
> 
> How much data was on the drive that failed?  How much data do you
> usually have per node?
> 
> Thanks,
> Andy
> 
>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
>>  wrote:
>> 
>> Thank you Andy.
>> Is there a way to just remove the drive from the cluster and replace it
>> later?  Ordering replacement drives isn't a fast process...
>> What I've done so far is:
>> Stop node
>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>> Restart node
>> Run repair
>> 
>> Will that work?  Right now, it's showing all nodes as up.
>> 
>> -Joe
>> 
>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
>>> Hi Joe,
>>> 
>>> I'd recommend just doing a replacement, bringing up a new node with
>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
>>> described here:
>>> https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node
>>> 
>>> Before you do that, you will want to make sure a cycle of repairs has
>>> run on the replicas of the down node to ensure they are consistent
>>> with each other.
>>> 
>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the node
>>> you are replacing and that the initial_token matches the node you are
>>> replacing (If you are not using vnodes) so the node doesn't skip
>>> bootstrapping.  This is the default, but felt worth mentioning.
>>> 
>>> You can also remove the dead node, which should stream data to
>>> replicas that will pick up new ranges, but you also will want to do
>>> repairs ahead of time too.  To be honest it's not something I've done
>>> recently, so I'm not as confident on executing that procedure.
>>> 
>>> Thanks,
>>> Andy
>>> 
>>> 
>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
>>>  wrote:
 Hi all - what is the correct procedure when handling a failed disk?
 Have a node in a 15 node cluster.  This node has 16 drives and cassandra
 data is split across them.  One drive is failing.  Can I just remove it
 from the list and cassandra will then replicate? If not - what?
 Thank you!
 
 -Joe
 
 
 --
 This email has been checked for viruses by AVG antivirus software.
 www.avg.com
>> 
>> --
>> This email has been checked for viruses by AVG antivirus software.
>> www.avg.com

Re: Cassandra 4.0.7 - issue - service not starting

2022-12-08 Thread Jeff Jirsa

What version of java are you using?


On Thu, Dec 8, 2022 at 8:07 AM Amit Patel via user <
user@cassandra.apache.org> wrote:

> Hi,
>
>
>
> I have installed cassandra-4.0.7-1.noarch  - repo ( baseurl=
> https://redhat.cassandra.apache.org/40x/noboolean/)  on Redhat 7.9.
>
>
>
> We have configured the below properties in cassandra.yaml
>
>
>
> Basic Parameters configured in  /etc/cassandra/conf/cassandra.yaml"
>
>
>
> cluster_name: 'CDBCluster'
>
> seed_provider:
>
>   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
>
> parameters:
>
>  - seeds: "10.33.1.200"
>
> listen_address: cdb-server01.local
>
> rpc_address: 10.33.1.200
>
> endpoint_snitch: GossipingPropertyFileSnitch
>
>
>
>
>
> when I start the Cassandra it fails to start.
>
>
>
> Below : system.log ( I don’t think it is related yaml issue but something
> else) .
>
>
>
> INFO  [main] 2022-12-08 13:20:57,923 YamlConfigurationLoader.java:97 -
> Configuration location: file:/etc/cassandra/default.conf/cassandra.yaml
>
> Exception (org.apache.cassandra.exceptions.ConfigurationException)
> encountered during startup: Invalid yaml:
> file:/etc/cassandra/default.conf/cassandra.yaml
>
> Error: Can't construct a java object for 
> tag:yaml.org,2002:org.apache.cassandra.config.Config;
> exception=java.lang.reflect.InvocationTargetException
>
> in 'reader', line 12, column 1:
>
> cluster_name: 'CDBCluster'
>
> ^
>
>
>
> Invalid yaml: file:/etc/cassandra/default.conf/cassandra.yaml
>
> Error: Can't construct a java object for 
> tag:yaml.org,2002:org.apache.cassandra.config.Config;
> exception=java.lang.reflect.InvocationTargetException
>
> in 'reader', line 12, column 1:
>
> cluster_name: 'CDBCluster'
>
> ^
>
>
>
> ERROR [main] 2022-12-08 13:20:58,185 CassandraDaemon.java:911 - Exception
> encountered during startup: Invalid yaml:
> file:/etc/cassandra/default.conf/cassandra.yaml
>
> Error: Can't construct a java object for 
> tag:yaml.org,2002:org.apache.cassandra.config.Config;
> exception=java.lang.reflect.InvocationTargetException
>
> in 'reader', line 12, column 1:
>
> cluster_name: 'CDBCluster'
>
> ^
>
>
>
>
>
> I would appreciate if you please advice what is wrong?  I have tried older
> version also 4.0.6 and 4.0.4 but same error even with default installation
> (config files) same issue.
>
>
>
> Thanks & Kind Regards,
>
> Amit Patel
>
> This e-mail message, including any attachments transmitted with it, is
> CONFIDENTIAL and may contain legally privileged information. This message is
> intended solely for the use of the individual or entity to whom it is 
> addressed. If
> you have received this message in error, please notify us immediately and 
> delete
> it from your system. Please visit our website to read the full disclaimer 
> www.euroclear.com/disclaimer and for Euroclear Group company
> information www.euroclear.com/aboutus
>
>

Re: Unable to gossip with peers when starting cluster

2022-11-09 Thread Jeff Jirsa

When you say you configured them to talk to .0.31 as a seed, did you do
that by changing the yaml?

Was 0.9 ever a seed before?

I expect if you start 0.7 and 0.9 at the same time, it all works. This
looks like a logic/state bug that needs to be fixed, though.

(If you're going to upgrade, usually you start with all 3 hosts up, and
restart one at a time. Starting with 0 online is likely poorly tested, and
we should fix that).



On Wed, Nov 9, 2022 at 7:08 AM Klein, Benjamin E (PERATON) <
benjamin.e.kl...@peraton.com> wrote:

> I am trying to upgrade a three-node Cassandra cluster (192.168.0.31,
> 192.168.0.7, and 192.168.0.9) from 3.11 to 4.0.3. At the start of the
> process, all three nodes are down. I have configured all three nodes to
> have 192.168.0.31:7000 as their only seed.
>
> I am trying to bring all three nodes up, one at a time. Starting Node 1
> (.31) works just fine. However, Node 2 (.7) fails to start with the error
> message "Unable to gossip with any peers". The configuration file and log
> from Node 2 are attached (the log has had lines related to loading
> individual tables snipped); the relevant portion of the log is at the
> bottom of this message. Note that this node was able to successfully
> connect to the other seed node.
>
> I have already tried the following unsuccessfully:
>
> * Starting with a completely blank (i.e., newly formatted) /data drive on
> all nodes. This worked fine the first time the cluster started; however,
> attempting to restart the cluster gives the same error.
> * Ensuring that all clocks are synchronized to the same NTP servers, which
> have a ping time to all three nodes of approximately 0.5-1.0ms
> * Setting the cross_node_timeout configuration entry to false
> * Setting the internode_tcp_connect_timeout_in_ms configuration entry to
> 2
> * Adding an entry for each node in its /etc/hosts file (e.g., Node 1 gets
> the entry "192.168.0.31 node-1")
>
> Is there anything else I should try?
>
> ---
> Relevant portion of Cassandra log:
> INFO  [main] 2022-11-04 16:57:02,541 StorageService.java:755 - Loading
> persisted ring state
> INFO  [main] 2022-11-04 16:57:02,541 StorageService.java:838 - Populating
> token metadata from system tables
> INFO  [GossipStage:1] 2022-11-04 16:57:02,570 Gossiper.java:1969 - Adding /
> 192.168.0.31:7000 as there was no previous epState; new state is
> EndpointState: HeartBeatState = HeartBeat: generation = 0, version = -1,
> AppStateMap = {}
> INFO  [GossipStage:1] 2022-11-04 16:57:02,570 Gossiper.java:1969 - Adding /
> 192.168.0.9:7000 as there was no previous epState; new state is
> EndpointState: HeartBeatState = HeartBeat: generation = 0, version = -1,
> AppStateMap = {}
> INFO  [main] 2022-11-04 16:57:02,705 InboundConnectionInitiator.java:127 -
> Listening on address: (/192.168.0.7:7000), nic: eth0, encryption:
> unencrypted
> INFO  [Messaging-EventLoop-3-3] 2022-11-04 16:57:02,993
> OutboundConnection.java:1150 - /192.168.0.7:7000(/192.168.0.7:55882
> )->/192.168.0.31:7000-URGENT_MESSAGES-ef0bde62 successfully connected,
> version = 12, framing = CRC, encryption = unencrypted
> INFO  [Messaging-EventLoop-3-6] 2022-11-04 16:57:07,938
> NoSpamLogger.java:92 - 
> /192.168.0.7:7000->/192.168.0.9:7000-URGENT_MESSAGES-[no-channel]
> failed to connect
> io.netty.channel.AbstractChannel$AnnotatedConnectException:
> finishConnect(..) failed: Connection refused: /192.168.0.9:7000
> Caused by: java.net.ConnectException: finishConnect(..) failed: Connection
> refused
> at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
> at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
> at
> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
> at
> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
> at
> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
> at
> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Exception (java.lang.RuntimeException) encountered during startup: Unable
> to gossip with any peers
> java.lang.RuntimeException: Unable to gossip with any peers
> at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1844)
> at
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:650)
> at
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:936)
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:786)
> at

Re: concurrent sstable read

2022-10-25 Thread Jeff Jirsa

Sequentially, and yes - for some definition of "directly" - but not just
because it's sequential, but also because each sstable has cost in reading
(e.g. JVM garbage created when you open/seek that has to be collected after
the read)

On Tue, Oct 25, 2022 at 8:27 AM Grzegorz Pietrusza 
wrote:

> HI all
>
> I can't find any information about how cassandra handles reads involving
> multiple sstables. Are sstables read concurrently or sequentially? Is read
> latency directly connected to the number of opened sstables?
>
> Regards
> Grzegorz
>

Re: Doubts on multiple filter design in cassandra

2022-10-16 Thread Jeff Jirsa

The limit only bounds what you return not what you scan On Oct 3, 2022, at 10:56 AM, Regis Le Bretonnic wrote:Hi...We do the same (even if a lot of people will say it's bad and that you shouldn't...) with a "allow filtering" BUT ALWAYS WITHIN A PARTITION AND WITH A LIMIT CLAUSE TO AVOID A FULL PARTITION SCAN..So you need to know the organisation_id and the product_type... and paginate your result with "product_name > X" LIMIT 20" where X is the last product_name returned on the previous page (LIMIT is applied after the WHERE clause and defined the number of rows returned and not the number of rows scanned)..This works fine within a partition but if you don't have (organisation_id, product_type), don't do it and have a look on secondary index.Le lun. 3 oct. 2022 à 18:26, Karthik K a écrit :We have a table designed to retrieve products by name in ascending order. OrganisationID and ProductType will be the compound partition key, whereas the ProductName will be the clustering key. So, the primary key structure is ((organisation_id, product_type), product_name) with clustering order by(product_name asc). All have text as a datatype.We have 20-30 other attributes relevant to the product stored in other different columns. Out of which some 5 attributes are significant. For instance, those attributes can be description, colour, city, size and date_of_manufacturing. All the above attributes are of text datatype except for date_of_manufacturing which is a timestamp.Let's say a user wants to filter this product based on all these 5 attributes. Can this be done using cassandra? Though we know that this can be achieved using elastic search on top of cassandra, our constraint is to use cassandra alone and achieve this. Storing data across many tables is allowed. Note:At any instant, only 20 products can be listed in the page, which means after applying all filters, we must display only 20 products.

Re: TWCS recommendation on number of windows

2022-09-28 Thread Jeff Jirsa

So when I wrote TWCS, I wrote it for a use case that had 24h TTLs and 30
days of retention. In that application, we had tested 12h windows, 24h
windows, and 7 day windows, and eventually settled on 24h windows because
that balanced factors like sstable size, sstables-per-read, and expired
data waiting to be dropped (about 3%, 1/30th, on any given day). That's
where that recommendation came from - it was mostly around how much expired
data will sit around waiting to be dropped. That doesn't change with
multiple data directories.

If you go with fewer windows, you'll expire larger chunks at a time, which
means you'll retain larger chunks waiting on expiration.
If you go with more windows, you'll potentially touch more sstables on read.

Realistically, if you can model your data to align with chunks (so each
read only touches one window), the actual number of sstables shouldn't
really matter much - the timestamps and bloom filter will avoid touching
most of them on the read path anyway. If your data model doesnt have a
timestamp component to it and you're touching lots of sstables on read,
even 30 sstables is probably going to hurt you, and 210 would be really,
really bad.

On Wed, Sep 28, 2022 at 7:00 AM Grzegorz Pietrusza 
wrote:

> Hi All!
>
> According to TWCS documentation (
> https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html)
> the operator should choose compaction window parameters to select a
> compaction_window_unit and compaction_window_size pair that produces
> approximately 20-30 windows.
>
> I'm curious where this recommendation comes from? Also should the number
> of windows be changed when more than one data directory is used? In my
> example there are 7 data directories (partitions) and it seems that all of
> them store 20-30 windows. Effectively this gives 140-210 sstables in total.
> Is that an optimal configuration?
>
> Running on Cassandra 3.11
>
> Regards
> Grzegorz
>

Re: Cassandra GC tuning

2022-09-20 Thread Jeff Jirsa

Beyond this there are two decent tuning sets, but relatively dated at this pointCassandra-8150 proposed a number of changes to defaults based on how it had been tuned at a specific large (competent) user:ASF JIRAissues.apache.orgAny Tobey wrote this guide around the 2.0/2.1 era, so it assumes things like jdk8 / CMS, but still has more rigor than most other guides you’ll find elsewhere and may help identify what’s going on even if the specific tuning isn’t super relevant in all cases:Amy's Cassandra 2.1 tuning guide : Amy Writestobert.github.ioOn Sep 20, 2022, at 5:27 AM, Michail Kotsiouros via user  wrote:







Hello community,
BTW I am using Cassandra 3.11.4. From your comments, I understand that a CPU spike and maybe a long GC may be expected at the snapshot creation under specific circumstances. I will monitor the resources during snapshot creation. I will
 come back with more news.
 
Thanks a lot for your valuable input.
 
BR
MK

From: Jeff Jirsa  
Sent: Monday, September 19, 2022 20:06
To: user@cassandra.apache.org; Michail Kotsiouros 
Subject: Re: Cassandra GC tuning

 

https://issues.apache.org/jira/browse/CASSANDRA-13019 is in 4.0, you may find that tuning those thresholds 

 


On Mon, Sep 19, 2022 at 9:50 AM Jeff Jirsa <jji...@gmail.com> wrote:



Snapshots are probably actually caused by a spike in disk IO and disk latency, not GC (you'll see longer STW pauses as you get to a safepoint if that disk is hanging). This is especially problematic on SATA SSDs, or nVME SSDs with poor
 IO scheduler tuning.  There's a patch somewhere to throttle hardlinks to try to mitigate this. 

 


On Mon, Sep 19, 2022 at 3:45 AM Michail Kotsiouros via user <user@cassandra.apache.org> wrote:





Hello community,
I observe some GC pauses while trying to create snapshots of a keyspace. The GC pauses as such are not long, even though they are reported in logs. The problem is the CPU utilization
 which affects other applications deployed in my server.
Do you have any articles or recommendations about tuning GC in Cassandra?
 
Thank you in advance.
BR
MK

Re: Cassandra GC tuning

2022-09-19 Thread Jeff Jirsa

https://issues.apache.org/jira/browse/CASSANDRA-13019 is in 4.0, you may
find that tuning those thresholds

On Mon, Sep 19, 2022 at 9:50 AM Jeff Jirsa  wrote:

> Snapshots are probably actually caused by a spike in disk IO and disk
> latency, not GC (you'll see longer STW pauses as you get to a safepoint if
> that disk is hanging). This is especially problematic on SATA SSDs, or nVME
> SSDs with poor IO scheduler tuning.  There's a patch somewhere to throttle
> hardlinks to try to mitigate this.
>
> On Mon, Sep 19, 2022 at 3:45 AM Michail Kotsiouros via user <
> user@cassandra.apache.org> wrote:
>
>> Hello community,
>>
>> I observe some GC pauses while trying to create snapshots of a keyspace.
>> The GC pauses as such are not long, even though they are reported in logs.
>> The problem is the CPU utilization which affects other applications
>> deployed in my server.
>>
>> Do you have any articles or recommendations about tuning GC in Cassandra?
>>
>>
>>
>> Thank you in advance.
>>
>> BR
>>
>> MK
>>
>

Re: Cassandra GC tuning

2022-09-19 Thread Jeff Jirsa

Snapshots are probably actually caused by a spike in disk IO and disk
latency, not GC (you'll see longer STW pauses as you get to a safepoint if
that disk is hanging). This is especially problematic on SATA SSDs, or nVME
SSDs with poor IO scheduler tuning.  There's a patch somewhere to throttle
hardlinks to try to mitigate this.

On Mon, Sep 19, 2022 at 3:45 AM Michail Kotsiouros via user <
user@cassandra.apache.org> wrote:

> Hello community,
>
> I observe some GC pauses while trying to create snapshots of a keyspace.
> The GC pauses as such are not long, even though they are reported in logs.
> The problem is the CPU utilization which affects other applications
> deployed in my server.
>
> Do you have any articles or recommendations about tuning GC in Cassandra?
>
>
>
> Thank you in advance.
>
> BR
>
> MK
>

Re: Local reads metric

2022-09-17 Thread Jeff Jirsa

Yes

> On Sep 17, 2022, at 10:46 PM, Gil Ganz  wrote:
> 
> 
> Hey
> Do reads that come from a read repair are somehow counted as part of the 
> local read metric? 
> i.e 
> org.apache.cassandra.metrics.Table... : 
> ReadLatency.1m_rate
> 
> Version is 4.0.4
> 
> Gil

Re: TimeWindowCompactionStrategy Operational Concerns

2022-09-15 Thread Jeff Jirsa

If you were able to generate old data offline, using something like the
CQLSSTableWriter class, you can add that to the cluster (either via
streaming or nodetool import), that would maintain the TWCS invariant.

That said, with https://issues.apache.org/jira/browse/CASSANDRA-13418 , IF
you're comfortable with the safety guarantees there (you only use TTLs, you
don't explicitly issue deletes, etc), both of those concerns become much
less important.

On Thu, Sep 15, 2022 at 8:21 AM Michel Barret 
wrote:

> Hi,
>
> I want to use TWCS on a cassandra table. Documentation explain 2
> concerns about it:
>
> In case we mix old and new data "in the traditional write path".
>
> => How can I create another write path to ensure that my old data aren't
> in same sstable than new?
>
>
> If I query old data and that generate a repair.
>
> => How can I check this and/or ensure don't trigg repair with query?
>
>
> Thank you all for your help
>
> Have a nice day
>
>

Re: Bootstrap data streaming order

2022-09-12 Thread Jeff Jirsa

Sorry, I think the comment below is right, but there's some ambiguity, so
adding more words.

Each sending host will send each set of tables/keyspaces serially. So the
number of concurrent streams is capped by the number of hosts in the
cluster (not hosts * RF or hosts * tokens * RF, it's just one per host).

If you're using rack aware  (or in AWS, AZ-aware) snitches, it's also
influenced by the number of hosts in the rack.




On Mon, Sep 12, 2022 at 7:16 AM Jeff Jirsa  wrote:

> A new node joining will receive (replication factor) streams for each
> token it has. If you use single token and RF=3, three hosts will send data
> at the same time (the data sent is the “losing” replica of the data based
> on the next/new topology that will exist after the node finishes
> bootstrapping m)
>
> The actual steaming plan is calculated by the joining host, and it
> executed the streams concurrently. This is one of the reasons vnodes were
> created - many more streaming sources so you can add/remove nodes much
> faster
>
> This is also true of moves and decommissions, but not replacements on the
> same token.
>
> For the display order - I’m not sure what order you’re talking about being
> ascending (IP? Perhaps?), but nodetool ring I think displas in token order.
>
>
>
> On Sep 12, 2022, at 12:05 AM, Marc Hoppins  wrote:
>
> 
>
> It doesn’t. However, I like to know things. Thus, I wanted to know what
> determines which nodes send their data in the order they do.
>
>
>
> Similarly, when the cluster was created, I added the seeds nodes in
> numerically ascending order and then the other nodes in a similar fashion.
> So why doesn’t nodetool display the status in that same order?
>
>
>
> *From:* Gil Ganz 
> *Sent:* Monday, September 12, 2022 8:50 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Bootstrap data streaming order
>
>
>
> EXTERNAL
>
> I can understand why the number of nodes sending at once might be
> interesting somehow, but why would the order of the nodes matter?
>
>
>
> On Fri, Sep 9, 2022 at 10:27 AM Marc Hoppins 
> wrote:
>
> Curiosity as to which data/node starts first, what determines the delivery
> sequence, how many nodes send data at once and what determines that limit?
>
> The usual kind of questions.
>
>
>
> -Original Message-
> From: Dinesh Joshi 
> Sent: Friday, September 9, 2022 9:14 AM
> To: user@cassandra.apache.org
> Subject: Re: Bootstrap data streaming order
>
> EXTERNAL
>
>
> The data is requested asynchronously from peers. There is some logic to
> select the peers however there isn’t a set order for data delivery. Why do
> you ask?
>
> >
> > On Sep 8, 2022, at 11:35 PM, Marc Hoppins  wrote:
> >
> > Hulloa all,
> >
> > Can anyone shed light on the order which nodes will deliver data to a
> new node?  Or point me toward a suitable chart/document?
> >
> > Does the new node accept data from each node in turn or simultaneously
> from multiple nodes?
> >
> > Thanks
> >
> > Marc
>
>

Re: Bootstrap data streaming order

2022-09-12 Thread Jeff Jirsa

A new node joining will receive (replication factor) streams for each token it 
has. If you use single token and RF=3, three hosts will send data at the same 
time (the data sent is the “losing” replica of the data based on the next/new 
topology that will exist after the node finishes bootstrapping m)

The actual steaming plan is calculated by the joining host, and it executed the 
streams concurrently. This is one of the reasons vnodes were created - many 
more streaming sources so you can add/remove nodes much faster

This is also true of moves and decommissions, but not replacements on the same 
token.

For the display order - I’m not sure what order you’re talking about being 
ascending (IP? Perhaps?), but nodetool ring I think displas in token order. 

> On Sep 12, 2022, at 12:05 AM, Marc Hoppins  wrote:
> 
> 
> It doesn’t. However, I like to know things. Thus, I wanted to know what 
> determines which nodes send their data in the order they do.
>  
> Similarly, when the cluster was created, I added the seeds nodes in 
> numerically ascending order and then the other nodes in a similar fashion. So 
> why doesn’t nodetool display the status in that same order?
>  
> From: Gil Ganz  
> Sent: Monday, September 12, 2022 8:50 AM
> To: user@cassandra.apache.org
> Subject: Re: Bootstrap data streaming order
>  
> EXTERNAL
> 
> I can understand why the number of nodes sending at once might be interesting 
> somehow, but why would the order of the nodes matter?
>  
> On Fri, Sep 9, 2022 at 10:27 AM Marc Hoppins  wrote:
> Curiosity as to which data/node starts first, what determines the delivery 
> sequence, how many nodes send data at once and what determines that limit?
> 
> The usual kind of questions.
> 
> 
> 
> -Original Message-
> From: Dinesh Joshi  
> Sent: Friday, September 9, 2022 9:14 AM
> To: user@cassandra.apache.org
> Subject: Re: Bootstrap data streaming order
> 
> EXTERNAL
> 
> 
> The data is requested asynchronously from peers. There is some logic to 
> select the peers however there isn’t a set order for data delivery. Why do 
> you ask?
> 
> >
> > On Sep 8, 2022, at 11:35 PM, Marc Hoppins  wrote:
> >
> > Hulloa all,
> >
> > Can anyone shed light on the order which nodes will deliver data to a new 
> > node?  Or point me toward a suitable chart/document?
> >
> > Does the new node accept data from each node in turn or simultaneously from 
> > multiple nodes?
> >
> > Thanks
> >
> > Marc

Re: Adding nodes

2022-07-12 Thread Jeff Jirsa

Your rack awareness problem is described in
https://issues.apache.org/jira/browse/CASSANDRA-3810 from 2012.

The fundamental problem is that Cassandra wont move data except during
bootstrap, decom, and explicit moves.  The implication here is exactly what
you've encountered - if you tell cassandra to use racks, it's going to
distribute one replica onto each rack. To make rack awareness work, it has
to move that data on bootstrap, otherwise the first read will immediately
violate data placement rules and miss finding the data on read. When you
move the data on bootstrap, you have a state transition problem for which
nobody has proposed a workaround (because it's approximately very hard
given cassandra's architecture). If you want to use rack awareness, you
need to start with # of racks >= replication factor. Any other
configuration is moving from an invalid state to a valid state, and that
state transition is VERY bumpy.

Beyond that, your replication factors dont make sense (as others have
pointed out), and you dont have to pay to be told that, you can find free
doc content / youtube content that teaches you the same thing. I'm not a
datastax employee, but their dev rel team has a TON of free content on
youtube that does a very good job describing the tradeoffs.

For your actual problem, beyond the fact that you're streaming a copy of
all of the data in the cluster because of the 3810/rack count problem, the
following things are true:
- You'll almost certainly always stream from all the hosts in the cluster
because you're using vnodes, and this is one of the fundamental reasons
vnodes were introduced. By adding extra ranges to a node, you add extra
streaming sources. This is a feature to increase speed, but
- You're probably streaming too fast, causing GC pauses that's breaking
streaming and causing the joining node to drop from the cluster. I'm not
positive here, but if I had to guess based on all the other defaults I see,
it may be because it's using STCS and deserializing/reserializing every
data file rather than using the zero copy streaming on LCS. This means your
throttle here is setting the stream throughput via yaml/nodetool, to let it
stream at a consistent rate without overrunning GC on the joining node
- If it's not that, you're either seeing a bootstrap bug in 4.0 that I
haven't seen before (possible), or you're missing another log message
somewhere in the cluster, but it's not obvious exactly, I'd probably need
to see all of the logs and all of the gossipinfo from the cluster, but I'm
muting this thread after this email.
- Even if you fix the bootstrap thing, as Bowen pointed out, your
replication factor probably won't do what you want. It turns out 2 copies
in each of 2 DCs CAN be a valid replication factor, but it requires you
understand the visibility tradeoffs (if you write QUORUM, you have an
outage if either dc is down or the WAN is cut, if you write LOCAL_QUORUM,
you have an outage if any host goes down in the main DC). So if your goal
is to reclaim space from HDFS' RF=3 behavior, you're probably solving the
wrong problem.






On Tue, Jul 12, 2022 at 8:01 AM Marc Hoppins  wrote:

> I posted system log data, GC log data, debug log data, nodetool data.  I
> believe I had described the situation more than adequately. Yesterday, I
> was asking what I assumed to be reasonable questions regarding the method
> for adding new nodes to a new rack.
>
>
>
> Forgive me if it sounds unreasonable but I asked the same question again:
> your response regarding replication suggests that multiple racks in a
> datacentre is ALWAYS going to be the case when setting up a Cassandra
> cluster. Therefore, I can only assume that when setting up a new cluster
> there absolutely MUST be more than one rack.  The question I was asking
> yesterday regarding adding a new nodes in a new rack has never been
> adequately answered here and the only information I can find elsewhere
> clearly states that it is not recommended to add more than one new node at
> a time to maintain data/token consistency.
>
>
>
> So how is it possible to add new hardware when one-at-a-time will
> absolutely overload the first node added?  That seems like a reasonable,
> general question which anyone considering employing the software is going
> to ask.
>
>
>
> The reply to suggest that folk head off a pay for a course when there are
> ‘pre-sales’ questions is not a practical response as any business is
> unlikely to be spending speculative money.
>
>
>
> *From:* Jeff Jirsa 
> *Sent:* Tuesday, July 12, 2022 4:43 PM
> *To:* cassandra 
> *Cc:* Bowen Song 
> *Subject:* Re: Adding nodes
>
>
>
> EXTERNAL
>
>
>
>
>
> On Tue, Jul 12, 2022 at 7:27 AM Marc Hoppins 
> wrote:
>
>
>
> I was asking the questions but no one cared to answer.
>
>
>
> This is probably a combination of "it is really hard to answer a question
> with insufficient data" and your tone. Nobody here gets paid to help you
> solve your company's problems except you.
>
>
>
>
>
>
>
>
>

Re: Adding nodes

2022-07-12 Thread Jeff Jirsa

On Tue, Jul 12, 2022 at 7:27 AM Marc Hoppins  wrote:

>
> I was asking the questions but no one cared to answer.
>

This is probably a combination of "it is really hard to answer a question
with insufficient data" and your tone. Nobody here gets paid to help you
solve your company's problems except you.

Re: Adding nodes

2022-07-12 Thread Jeff Jirsa

 network connectivity or stuck in long STW GC pauses.
> Regardless of the reason behind it, the state shown on the joining node
> will remain as joining unless the steaming process has failed.
>
> The node state is propagated between nodes via gossip, and there may be a
> delay before all existing nodes agree on the fact that the joining node is
> no longer in the cluster. Within that delay, different nodes in the cluster
> may show different results in 'nodetool status'.
>
> You should check the logs on the existing nodes and the joining node to
> find out why is it happening, and make appropriate changes if needed.
>
> On 11/07/2022 09:23, Marc Hoppins wrote:
>
> Further oddities…
>
>
>
> I was sitting here watching our new new node being added (nodetool status
> being run from one of the seed nodes) and all was going well.  Then I
> noticed that our new new node was no longer visible.  I checked the service
> on the new new node and it was still running. So I checked status from this
> node and it shows in the status report (still UJ and streaming data), but
> takes a little longer to get the results than it did when it was visible
> from the seed.
>
>
>
> I checked status from a few different nodes in both datacentres (including
> other seeds) and the new new node shows up but from the original seed node,
> it does not appear in the nodetool status. Can anyone shed any light on
> this phenomena?
>
>
>
> *From:* Marc Hoppins  
> *Sent:* Monday, July 11, 2022 10:02 AM
> *To:* user@cassandra.apache.org
> *Cc:* Bowen Song  
> *Subject:* RE: Adding nodes
>
>
>
> Well then…
>
>
>
> I left this on Friday (still running) and came back to it today (Monday)
> to find the service stopped.  So, I blitzed this node from the ring and
> began anew with a different new node.
>
>
>
> I rather suspect the problem was with trying to use Ansible to add these
> initially - despite the fact that I had a serial limit of 1 and a pause of
> 90s for starting the service on each new node (based on the time taken when
> setting up this Cassandra cluster).
>
>
>
> So…moving forward…
>
>
>
> It is recommended to only add one new node at a time from what I read.
> This leads me to:
>
>
>
> Although I see the new node LOAD is progressing far faster than the
> previous failure, it is still going to take several hours to move from UJ
> to UN, which means I’ll be at this all week for the 12 new nodes. If our
> LOAD per node is around 400-600GB, is there any practical method to speed
> up adding multiple new nodes which is unlikely to cause problems?  After
> all, in the modern world of big (how big is big?) data, 600G per node is
> far less than the real BIG big-data.
>
>
>
> Marc
>
>
>
> *From:* Jeff Jirsa 
> *Sent:* Friday, July 8, 2022 5:46 PM
> *To:* cassandra 
> *Cc:* Bowen Song 
> *Subject:* Re: Adding nodes
>
>
>
> EXTERNAL
>
> Having a node UJ but not sending/receiving other streams is an invalid
> state (unless 4.0 moved the streaming data out of netstats? I'm not 100%
> sure, but I'm 99% sure it should be there).
>
>
>
> It likely stopped the bootstrap process long ago with an error (which you
> may not have seen), and is running without being in the ring, but also not
> trying to join the ring.
>
>
>
> 145GB vs 1.1T could be bits vs bytes (that's a factor of 8), or it could
> be that you streamed data and compacted it away. Hard to say, but less
> important - the fact that it's UJ but not streaming means there's a
> different problem.
>
>
>
> If it's me, I do this (not guaranteed to work, your mileage may vary, etc):
>
> 1) Look for errors in the logs of ALL hosts. In the joining host, look for
> an exception that stops bootstrap. In the others, look for messages about
> errors streaming, and/or exceptions around file access. In all of those
> hosts, check to see if any of them think they're streaming ( nodetool
> netstats again)
>
> 2) Stop the joining host. It's almost certainly not going to finish now.
> Remove data directories, commitlog directory, saved caches, hints. Wait 2
> minutes. Make sure every other host in the cluster sees it disappear from
> the ring. Then, start it fresh and let it bootstrap again. (you could
> alternatively try the resumable bootstrap option, but I never use it).
>
>
>
>
>
>
>
> On Fri, Jul 8, 2022 at 2:56 AM Marc Hoppins  wrote:
>
> Ifconfig shows RX of 1.1T. This doesn't seem to fit with the LOAD of
> 145GiB (nodetool status), unless I am reading that wrong...and the fact
> that this node still has a status of UJ.
>
> Netstats on this no

Re: Adding nodes

2022-07-08 Thread Jeff Jirsa

Having a node UJ but not sending/receiving other streams is an invalid
state (unless 4.0 moved the streaming data out of netstats? I'm not 100%
sure, but I'm 99% sure it should be there).

It likely stopped the bootstrap process long ago with an error (which you
may not have seen), and is running without being in the ring, but also not
trying to join the ring.

145GB vs 1.1T could be bits vs bytes (that's a factor of 8), or it could be
that you streamed data and compacted it away. Hard to say, but less
important - the fact that it's UJ but not streaming means there's a
different problem.

If it's me, I do this (not guaranteed to work, your mileage may vary, etc):
1) Look for errors in the logs of ALL hosts. In the joining host, look for
an exception that stops bootstrap. In the others, look for messages about
errors streaming, and/or exceptions around file access. In all of those
hosts, check to see if any of them think they're streaming ( nodetool
netstats again)
2) Stop the joining host. It's almost certainly not going to finish now.
Remove data directories, commitlog directory, saved caches, hints. Wait 2
minutes. Make sure every other host in the cluster sees it disappear from
the ring. Then, start it fresh and let it bootstrap again. (you could
alternatively try the resumable bootstrap option, but I never use it).

On Fri, Jul 8, 2022 at 2:56 AM Marc Hoppins  wrote:

> Ifconfig shows RX of 1.1T. This doesn't seem to fit with the LOAD of
> 145GiB (nodetool status), unless I am reading that wrong...and the fact
> that this node still has a status of UJ.
>
> Netstats on this node shows (other than :
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool NameActive   Pending  Completed   Dropped
> Large messages  n/a 0  0 0
> Small messages  n/a53  569755545  15740262
> Gossip messages n/a 0 288878 2
> None of this addresses the issue of not being able to add more nodes.
>
> -Original Message-
> From: Bowen Song via user 
> Sent: Friday, July 8, 2022 11:47 AM
> To: user@cassandra.apache.org
> Subject: Re: Adding nodes
>
> EXTERNAL
>
>
> I would assume that's 85 GB (i.e. gigabytes) then. Which is approximately
> 79 GiB (i.e. gibibytes). This still sounds awfully slow - less than 1MB/s
> over a full day (24 hours).
>
> You said CPU and network aren't the bottleneck. Have you checked the disk
> IO? Also, be mindful with CPU usage. It can still be a bottleneck if one
> thread uses 100% of a CPU core while all other cores are idle.
>
> On 08/07/2022 07:09, Marc Hoppins wrote:
> > Thank you for pointing that out.
> >
> > 85 gigabytes/gibibytes/GIGABYTES/GIBIBYTES/whatever name you care to
> > give it
> >
> > CPU and bandwidth are not the problem.
> >
> > Version 4.0.3 but, as I stated, all nodes use the same version so the
> version is not important either.
> >
> > Existing nodes have 350-400+(choose whatever you want to call a
> > gigabyte)
> >
> > The problem appears to be that adding new nodes is a serial process,
> which is fine when there is no data and each node is added within
> 2minutes.  It is hardly practical in production.
> >
> > -Original Message-
> > From: Bowen Song via user 
> > Sent: Thursday, July 7, 2022 8:43 PM
> > To: user@cassandra.apache.org
> > Subject: Re: Adding nodes
> >
> > EXTERNAL
> >
> >
> > 86Gb (that's gigabits, which is 10.75GB, gigabytes) took an entire day
> seems obviously too long. I would check the network bandwidth, disk IO and
> CPU usage and find out what is the bottleneck.
> >
> > On 07/07/2022 15:48, Marc Hoppins wrote:
> >> Hi all,
> >>
> >> Cluster of 2 DC and 24 nodes
> >>
> >> DC1 (RF3) = 12 nodes, 16 tokens each
> >> DC2 (RF3) = 12 nodes, 16 tokens each
> >>
> >> Adding 12 more nodes to DC1: I installed Cassandra (version is the same
> across all nodes) but, after the first node added, I couldn't seem to add
> any further nodes.
> >>
> >> I check nodetool status and the newly added node is UJ. It remains this
> way all day and only 86Gb of data is added to the node over the entire day
> (probably not yet complete).  This seems a little slow and, more than a
> little inconvenient to only be able to add one node at a time - or at least
> one node every 2 minutes.  When the cluster was created, I timed each node
> from service start to status UJ (having a UUID) and it was around 120
> seconds.  Of course there was no data.
> >>
> >> Is it possible I have some setting not correctly tuned?
> >>
> >> Thanks
> >>
> >> Marc
>

Re: Adding nodes

2022-07-07 Thread Jeff Jirsa

What version are you using?

When you run `nodetool netstats` on the joining node, what is the output?

How much data is there per node (presumably more than 86G)?



On Thu, Jul 7, 2022 at 7:49 AM Marc Hoppins  wrote:

> Hi all,
>
> Cluster of 2 DC and 24 nodes
>
> DC1 (RF3) = 12 nodes, 16 tokens each
> DC2 (RF3) = 12 nodes, 16 tokens each
>
> Adding 12 more nodes to DC1: I installed Cassandra (version is the same
> across all nodes) but, after the first node added, I couldn't seem to add
> any further nodes.
>
> I check nodetool status and the newly added node is UJ. It remains this
> way all day and only 86Gb of data is added to the node over the entire day
> (probably not yet complete).  This seems a little slow and, more than a
> little inconvenient to only be able to add one node at a time - or at least
> one node every 2 minutes.  When the cluster was created, I timed each node
> from service start to status UJ (having a UUID) and it was around 120
> seconds.  Of course there was no data.
>
> Is it possible I have some setting not correctly tuned?
>
> Thanks
>
> Marc
>

Re: Query around Data Modelling -2

2022-06-30 Thread Jeff Jirsa

How are you running repair? -pr? Or -st/-et?

4.0 gives you real incremental repair which helps. Splitting the table won’t 
make reads faster. It will increase the potential parallelization of 
compaction. 

> On Jun 30, 2022, at 7:04 AM, MyWorld  wrote:
> 
> 
> Hi all,
> 
> Another query around data Modelling.
> 
> We have a existing table with below structure:
> Table(PK,CK, col1,col2, col3, col4,col5)
> 
> Now each Pk here have 1k - 10k Clustering keys. Each PK has size from 10MB to 
> 80MB. We have overall 100+ millions partitions. Also we have set levelled 
> compactions in place so as to get better read response time.
> 
> We are currently on 3.11.x version of Cassandra. On running a weekly repair 
> and compaction job, this model because of levelled compaction (occupied till 
> Level 3) consume heavy cpu resource and impact db performance.
> 
> Now what if we divide this table in 10 with each table containing 1/10 
> partitions. So now each table will be limited to levelled compaction upto 
> level-2. I think this would ease down read as well as compaction task.
> 
> What is your opinion on this?
> Even if we upgrade to ver 4.0, is the second model ok?
>

Re: Query around Data Modelling

2022-06-22 Thread Jeff Jirsa

This is assuming each row is like … I dunno 10-1000 bytes. If you’re storing 
like a huge 1mb blob use two tables for sure.  

> On Jun 22, 2022, at 9:06 PM, Jeff Jirsa  wrote:
> 
> 
> 
> Ok so here’s how I would think about this
> 
> The writes don’t matter. (There’s a tiny tiny bit of nuance in one table 
> where you can contend adding to the memtable but the best cassandra engineers 
> on earth probably won’t notice that unless you have really super hot 
> partitions, so ignore the write path).
> 
> The reads are where it changes
> 
> In both models/cases, you’ll use the partition index to seek to where the 
> partition starts. 
> 
> In model 2 table 1 if you use ck+col1+… the read will load the column index 
> and use that to jump to within 64kb of the col1 value you specify 
> 
> In model 2 table 2, if you use ck+col3+…, same thing - column index can jump 
> to within 64k
> 
> What you give up in model one is the granularity of that jump. If you use 
> model 1 and col3 instead of col1, the read will have to scan the partition. 
> In your case, with 80 rows, that may still be within that 64kb block - you 
> may not get more granular than that anyway. And even if it’s slightly larger, 
> you’re probably going to be compressing 64k chunks - maybe you have to 
> decompress one extra chunk on read if your 1000 rows goes past 64k, but you 
> likely won’t actually notice. You’re technically asking the server to read 
> and skip data it doesn’t need to return - it’s not really the most efficient, 
> but at that partition size it’s noise. You could also just return all 80-100 
> rows, let the server do slightly less work and filter client side - also 
> valid, probably slightly worse than the server side filter. 
> 
> Having one table instead of two, though, probably saves you a ton of disk 
> space ($€£), and the lower disk space may also mean that data stays in page 
> cache, so the extra read may not even go to disk anyway.
> 
> So with your actual data shape, I imagine you won’t really notice the nominal 
> inefficiency of the first model, and I’d be inclined to do that until you 
> demonstrate it won’t work (I bet it works fine for a long long time). 
> 
>>> On Jun 22, 2022, at 7:11 PM, MyWorld  wrote:
>>> 
>> 
>> Hi Jeff,
>> Let me know how no of rows have an impact here.
>> May be today I have 80-100 rows per partition. But what if I started storing 
>> 2-4k rows per partition. However total partition size is still under 100 MB 
>> 
>>> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa  wrote:
>>> How many rows per partition in each model?
>>> 
>>> 
>>> > On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
>>> > 
>>> > 
>>> > Hi all,
>>> > 
>>> > Just a small query around data Modelling.
>>> > Suppose we have to design the data model for 2 different use cases which 
>>> > will query the data on same set of (partion+clustering key). So should we 
>>> > maintain a seperate table for each or a single table. 
>>> > 
>>> > Model1 - Combined table
>>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>>> > 
>>> > Model2 - Seperate tables
>>> > Table1(Pk,CK,col1,col2,col3)
>>> > Table2(Pk,CK,col3,col4,col45)
>>> > 
>>> > So here partion and clustering keys are same. Also note column col3 is 
>>> > required in both use cases.
>>> > 
>>> > As per my thought in Model2, partition size would be less. There would be 
>>> > less sstables and when I use level compaction, it could be easily 
>>> > maintained. So should be better read performance.
>>> > 
>>> > Please help me to highlight the drawback and advantage of each data 
>>> > model. Here we have a mix kind of workload (read/write)

Re: Query around Data Modelling

2022-06-22 Thread Jeff Jirsa

Ok so here’s how I would think about this

The writes don’t matter. (There’s a tiny tiny bit of nuance in one table where 
you can contend adding to the memtable but the best cassandra engineers on 
earth probably won’t notice that unless you have really super hot partitions, 
so ignore the write path).

The reads are where it changes

In both models/cases, you’ll use the partition index to seek to where the 
partition starts. 

In model 2 table 1 if you use ck+col1+… the read will load the column index and 
use that to jump to within 64kb of the col1 value you specify 

In model 2 table 2, if you use ck+col3+…, same thing - column index can jump to 
within 64k

What you give up in model one is the granularity of that jump. If you use model 
1 and col3 instead of col1, the read will have to scan the partition. In your 
case, with 80 rows, that may still be within that 64kb block - you may not get 
more granular than that anyway. And even if it’s slightly larger, you’re 
probably going to be compressing 64k chunks - maybe you have to decompress one 
extra chunk on read if your 1000 rows goes past 64k, but you likely won’t 
actually notice. You’re technically asking the server to read and skip data it 
doesn’t need to return - it’s not really the most efficient, but at that 
partition size it’s noise. You could also just return all 80-100 rows, let the 
server do slightly less work and filter client side - also valid, probably 
slightly worse than the server side filter. 

Having one table instead of two, though, probably saves you a ton of disk space 
($€£), and the lower disk space may also mean that data stays in page cache, so 
the extra read may not even go to disk anyway.

So with your actual data shape, I imagine you won’t really notice the nominal 
inefficiency of the first model, and I’d be inclined to do that until you 
demonstrate it won’t work (I bet it works fine for a long long time). 

> On Jun 22, 2022, at 7:11 PM, MyWorld  wrote:
> 
> 
> Hi Jeff,
> Let me know how no of rows have an impact here.
> May be today I have 80-100 rows per partition. But what if I started storing 
> 2-4k rows per partition. However total partition size is still under 100 MB 
> 
>> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa  wrote:
>> How many rows per partition in each model?
>> 
>> 
>> > On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
>> > 
>> > 
>> > Hi all,
>> > 
>> > Just a small query around data Modelling.
>> > Suppose we have to design the data model for 2 different use cases which 
>> > will query the data on same set of (partion+clustering key). So should we 
>> > maintain a seperate table for each or a single table. 
>> > 
>> > Model1 - Combined table
>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>> > 
>> > Model2 - Seperate tables
>> > Table1(Pk,CK,col1,col2,col3)
>> > Table2(Pk,CK,col3,col4,col45)
>> > 
>> > So here partion and clustering keys are same. Also note column col3 is 
>> > required in both use cases.
>> > 
>> > As per my thought in Model2, partition size would be less. There would be 
>> > less sstables and when I use level compaction, it could be easily 
>> > maintained. So should be better read performance.
>> > 
>> > Please help me to highlight the drawback and advantage of each data model. 
>> > Here we have a mix kind of workload (read/write)

Re: Query around Data Modelling

2022-06-22 Thread Jeff Jirsa

How many rows per partition in each model?


> On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
> 
> 
> Hi all,
> 
> Just a small query around data Modelling.
> Suppose we have to design the data model for 2 different use cases which will 
> query the data on same set of (partion+clustering key). So should we maintain 
> a seperate table for each or a single table. 
> 
> Model1 - Combined table
> Table(Pk,CK, col1,col2, col3, col4,col5)
> 
> Model2 - Seperate tables
> Table1(Pk,CK,col1,col2,col3)
> Table2(Pk,CK,col3,col4,col45)
> 
> So here partion and clustering keys are same. Also note column col3 is 
> required in both use cases.
> 
> As per my thought in Model2, partition size would be less. There would be 
> less sstables and when I use level compaction, it could be easily maintained. 
> So should be better read performance.
> 
> Please help me to highlight the drawback and advantage of each data model. 
> Here we have a mix kind of workload (read/write)

Re: Configuration for new(expanding) cluster and new admins.

2022-06-20 Thread Jeff Jirsa

One of the advantages of faster streaming in 4.0+ is that it’s now very much 
viable to do this entirely with bootstraps and decoms in the same DC, when you 
have use cases where you can’t just change DC names

Vnodes will cause more compaction than single token, but you can just add in 
all the extra hosts (running cleanup after they’re in the cluster), allow them 
to be underutilized, and then decommission the old hosts 

In this order they’ll never have more load than they started with, it’s 
strictly correct from a data visibility standpoint, and the bootstraps at the 
beginning drop the load on everything very quick so the rest of the operations 
are done at low cpu load relative to your starting point

The decoms will cause some compaction, so don’t rush those. 



> On Jun 20, 2022, at 9:45 AM, Elliott Sims  wrote:
> 
> 
> If the token value is the same across heterogenous nodes, it means that each 
> node gets a (roughly) equivalent amount of data and work to do.  So the 
> bigger servers would be under-utilized.
> 
> My answer so far to varied hardware getting out of hand is a periodic 
> hardware refresh and "datacenter" migration.  Stand up a logical "datacenter" 
> with all-new uniform denser hardware and a uniform vnode count (probably 16), 
> migrate to it, tear down the old hardware.
> 
>> On Thu, Jun 16, 2022 at 12:31 AM Marc Hoppins  wrote:
>> Thanks for that info.
>> 
>>  
>> 
>> I did see in the documentation that a value of 16 was not recommended for 
>> >50 hosts. Our existing hbase is 76 regionservers so I would imagine that 
>> (eventually) we will see a similar figure.
>> 
>>  
>> 
>> There will be some scenarios where an initial setup may have (eg) 2 x 8 HDD 
>> and future expansion adds either more HDD or newer nodes with larger 
>> storage.  It couldn’t be guaranteed that the storage would double but might 
>> increase by either less than 2x, or 3-4 x existing amount resulting in a 
>> heterogenous storage configuration.  In these cases how would it affect 
>> efficiency if the token figure were the same across all nodes?
>> 
>>  
>> 
>> From: Elliott Sims  
>> Sent: Thursday, June 16, 2022 12:24 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Configuration for new(expanding) cluster and new admins.
>> 
>>  
>> 
>> EXTERNAL
>> 
>> If you set a different num_tokens value for new hosts (the value should 
>> never be changed on an existing host), the amount of data moved to that host 
>> will be proportional to the num_tokens value.  So, if the new hosts are set 
>> to 32 when they're added to the cluster, those hosts will get twice as much 
>> data as the initial 16-token hosts.  
>> 
>> I think it's generally advised to keep a Cassandra cluster identical in 
>> terms of hardware and num_tokens, at least within a DC.  I suspect having a 
>> lot of different values would slow down Reaper significantly, but I've had 
>> decent results so far adding a few hosts with beefier hardware and 
>> num_tokens=32 to an existing 16-token cluster.
>> 
>>  
>> 
>> On Wed, Jun 15, 2022 at 1:33 AM Marc Hoppins  wrote:
>> 
>> Hi all,
>> 
>> Say we have 2 datacentres with 12 nodes in each. All hardware is the same.
>> 
>> 4-core, 2 x HDD (eg, 4TiB)
>> 
>> num_tokens = 16 as a start point
>> 
>> If a plan is to gradually increase the nodes per DC, and new hardware will 
>> have more of everything, especially storage, I assume I increase the 
>> num_tokens value.  Should I have started with a lower value?
>> 
>> What would be considered as a good adjustment for:
>> 
>> Any increase in number of HDD for any node?
>> 
>> Any increase in capacity per HDD for any node?
>> 
>> Is there any direct correlation between new token count and the proportional 
>> increase in either quantity of devices or total capacity, or is any 
>> adjustment purely arbitrary just to differentiate between varied nodes?
>> 
>> Thanks
>> 
>> M
>> 
>> 
>> This email, including its contents and any attachment(s), may contain 
>> confidential and/or proprietary information and is solely for the review and 
>> use of the intended recipient(s). If you have received this email in error, 
>> please notify the sender and permanently delete this email, its content, and 
>> any attachment(s). Any disclosure, copying, or taking of any action in 
>> reliance on an email received in error is strictly prohibited.
>> 
> 
> This email, including its contents and any attachment(s), may contain 
> confidential and/or proprietary information and is solely for the review and 
> use of the intended recipient(s). If you have received this email in error, 
> please notify the sender and permanently delete this email, its content, and 
> any attachment(s).  Any disclosure, copying, or taking of any action in 
> reliance on an email received in error is strictly prohibited.

Re: Configuration for new(expanding) cluster and new admins.

2022-06-15 Thread Jeff Jirsa

You shouldn't need to change num_tokens at all.  num_tokens helps you
pretend your cluster is a bigger than it is and randomly selects tokens for
you so that your data is approximately evenly distributed. As you add more
hosts, it should balance out automatically.

The alternative to num_tokens is to use a single token and explicitly
calculate it each time to ensure the cluster is properly balanced, and then
using `nodetool move` each time you add hosts to the cluster to
re-distribute load. num_tokens makes it less likely that you end up
imbalanced, so you shouldn't need to move any tokens manually.

On Wed, Jun 15, 2022 at 12:34 AM Marc Hoppins  wrote:

> Hi all,
>
> Say we have 2 datacentres with 12 nodes in each. All hardware is the same.
>
> 4-core, 2 x HDD (eg, 4TiB)
>
> num_tokens = 16 as a start point
>
> If a plan is to gradually increase the nodes per DC, and new hardware will
> have more of everything, especially storage, I assume I increase the
> num_tokens value.  Should I have started with a lower value?
>
> What would be considered as a good adjustment for:
>
> Any increase in number of HDD for any node?
>
> Any increase in capacity per HDD for any node?
>
> Is there any direct correlation between new token count and the
> proportional increase in either quantity of devices or total capacity, or
> is any adjustment purely arbitrary just to differentiate between varied
> nodes?
>
> Thanks
>
> M
>

Re: Cassandra 3.0 upgrade

2022-06-13 Thread Jeff Jirsa

The versions with caveats should all be enumerated in
https://github.com/apache/cassandra/blob/cassandra-3.0/NEWS.txt

The biggest caveat was 3.0.14 (which had the fix for cassandra-13004),
which you're already on.

Personally, I'd qualify exactly one upgrade, and rather than doing 3
different upgrades, just do exactly one and spend 3 times as long proving
it's safe in non-production.

On Mon, Jun 13, 2022 at 10:17 PM Runtian Liu  wrote:

> Hi,
>
> I am running Cassandra version 3.0.14 at scale on thousands of nodes. I am
> planning to do a minor version upgrade from 3.0.14 to 3.0.26 in a safe
> manner. My eventual goal is to upgrade from 3.0.26 to a major release 4.0.
>
> As you know, there are multiple minor releases between 3.0.14 and 3.0.26,
> so I am planning to upgrade in 2-3 batches say 1) 3.0.14 → 3.0.16 2) 3.0.16
> to 3.0.20 3) 3.0.20 → 3.0.26.
>
> . Do you have suggestions or anything that I need to be aware of? Is there
> any minor release between 3.0.14 and 3.0.26, which is not safe etc.?
>
> Best regards.
>
>

Re: Gossip issues after upgrading to 4.0.4

2022-06-07 Thread Jeff Jirsa

This deserves a JIRA ticket please.

(I assume the sending host is randomly choosing the bad IP and blocking on
it for some period of time, causing other tasks to pile up, but it should
be investigated as a regression).



On Tue, Jun 7, 2022 at 7:52 AM Gil Ganz  wrote:

> Yes, I know the issue with the peers table, we had it in different
> clusters, in this case it appears the cause of the problem was indeed a bad
> ip in the seed list.
> After removing it from all nodes and reloading seeds, running a rolling
> restart does not cause any gossip issues, and in general the number of
> gossip pending tasks is 0 all the time, vs jumping to 2-5 pending tasks
> every once in a while before this change.
>
> Interesting that this bad ip didn't cause an issue in 3.11.9, I guess
> something in the way gossip works in c*4 made it so it caused a real issue
> after the upgrade.
>
> On Tue, Jun 7, 2022 at 12:04 PM Bowen Song  wrote:
>
>> Regarding the "ghost IP", you may want to check the system.peers_v2 table
>> by doing "select * from system.peers_v2 where peer = '123.456.789.012';"
>>
>> I've seen this (non-)issue many times, and I had to do "delete from
>> system.peers_v2 where peer=..." to fix it, as on our client side, the
>> Python cassandra-driver, reads the token ring information from this table
>> and uses it for routing requests.
>> On 07/06/2022 05:22, Gil Ganz wrote:
>>
>> Only errors I see in the logs prior to gossip pending issue are things
>> like this
>>
>> INFO  [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833
>> NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel] failed
>> to connect
>> io.netty.channel.AbstractChannel$AnnotatedConnectException:
>> finishConnect(..) failed: No route to host: /Y:7000
>> Caused by: java.net.ConnectException: finishConnect(..) failed: No route
>> to host
>> at
>> io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
>> at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
>> at
>> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
>> at
>> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
>> at
>> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
>> at
>> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
>> at
>> io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>> at
>> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>> at
>> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> at
>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Remote ip mentioned here is an ip that is appearing in the seed list
>> (there are 20 other valid ip addresses in the seed clause), but it's no
>> longer a valid ip, it's an old ip of an existing server (it's not in the
>> peers table). I will try to reproduce the issue with this this ip removed
>> from seed list
>>
>>
>> On Mon, Jun 6, 2022 at 9:39 PM C. Scott Andreas 
>> wrote:
>>
>>> Hi Gil, thanks for reaching out.
>>>
>>> Can you check Cassandra's logs to see if any uncaught exceptions are
>>> being thrown? What you described suggests the possibility of an uncaught
>>> exception being thrown in the Gossiper thread, preventing further tasks
>>> from making progress; however I'm not aware of any open issues in 4.0.4
>>> that would result in this.
>>>
>>> Would be eager to investigate immediately if so.
>>>
>>> – Scott
>>>
>>> On Jun 6, 2022, at 11:04 AM, Gil Ganz  wrote:
>>>
>>>
>>> Hey
>>> We have a big cluster (>500 nodes, onprem, multiple datacenters, most
>>> with vnodes=32, but some with 128), that was recently upgraded from 3.11.9
>>> to 4.0.4. Servers are all centos 7.
>>>
>>> We have been dealing with a few issues related to gossip since :
>>> 1 - The moment the last node in the cluster was up with 4.0.4, and all
>>> nodes were in the same version, gossip pending tasks started to climb to
>>> very high numbers (>1M) in all nodes in the cluster, and quickly the
>>> cluster was practically down. Took us a few hours of stopping/starting up
>>> nodes, and adding more nodes to the seed list, to finally get the cluster
>>> back up.
>>> 2 - We notice that pending gossip tasks go up to very high
>>> numbers (50k), in random nodes in the cluster, without any meaningful event
>>> that happened and it doesn't look like it will go down on its own. After a
>>> few hours we restart those nodes and it goes back to 0.
>>> 3 - Doing a rolling restart to a list of servers is now an issue, more
>>> often then not, what will happen is one of the nodes we restart goes up
>>> with gossip issues, and we need a 2nd restart to get the gossip pending
>>> task

Re: Malformed IPV6 address

2022-04-26 Thread Jeff Jirsa

Oof. From which version did you upgrade?

I would try:
>  export _JAVA_OPTIONS="-Djava.net.preferIPv4Stack=true"

There's a chance that fixes it (for an unpleasant reason).

Did you get a specific stack trace / log message at all? or just that
error?





On Tue, Apr 26, 2022 at 1:47 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi All - upgraded java recently
> (java-11-openjdk-11.0.15.0.9-2.el7_9.x86_64) , and now getting:
>
> nodetool: Failed to connect to '127.0.0.1:7199' - URISyntaxException:
> 'Malformed IPv6 address at index 7: rmi://[127.0.0.1]:7199'.
>
> whenever running nodetool.
> What am I missing?
>
> Thanks!
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG.
> https://www.avg.com
>
>

Re: about the performance of select * from tbl

2022-04-26 Thread Jeff Jirsa

Yes, you CAN change the fetch size to adjust how many pages of results are
returned. But, if you have a million rows, you may still do hundreds or
thousands of queries, one after the next. Even if each is 1ms, it's going
to take a long time.

What Dor suggested is generating a number of SELECT statements, each of
which would return part of the table (using TOKEN()), that you can execute
in parallel. This will end up being much faster than trying to tune the
single SELECT.



On Tue, Apr 26, 2022 at 7:35 AM 18624049226 <18624049...@163.com> wrote:

> Thank you for your reply!
>
> What I want to know is that the data volume of this table is not massive.
> If the logic of CQL cannot be modified, just inside Cassandra, are there
> any parameters that can affect the behavior of this query? For example, the
> fetchSize parameter of other databases?
> 在 2022/4/26 21:18, Dor Laor 写道:
>
> select * reads all of the data from the cluster, obviously it would be bad
> if you'll
> run a single query and expect it to return 'fast'. The best way is to
> divide the data
> set into chunks which will be selected by the range ownership per node, so
> you'll
> be able to query in parallel the entire cluster and maximize the
> parallelism.
>
> If needed, I can provide an example for this
>
> On Tue, Apr 26, 2022 at 3:48 PM 18624049226 <18624049...@163.com> wrote:
>
>> We have a business scenario. We must execute the following statement:
>>
>> select * from tbl;
>>
>> This CQL has no WHERE condition.
>>
>> What I want to ask is that if the data in this table is more than one
>> million or more, what methods or parameters can improve the performance of
>> this CQL?
>>
>

Re: sstables changing in snapshots

2022-03-22 Thread Jeff Jirsa

The most useful thing that folks can provide is an indication of what was
writing to those data files when you were doing backups.

It's almost certainly one of:
- Memtable flush
- Compaction
- Streaming from repair/move/bootstrap

If you have logs that indicate compaction starting/finishing with those
sstables, or memtable flushing those sstables, or if the .log file is
included in your backup, pasting the contents of that .log file into a
ticket will make this much easier to debug.



On Tue, Mar 22, 2022 at 9:49 AM Yifan Cai  wrote:

> I do not think there is a ticket already. Feel free to create one.
> https://issues.apache.org/jira/projects/CASSANDRA/issues/
>
> It would be helpful to provide
> 1. The version of the cassandra
> 2. The options used for snapshotting
>
> - Yifan
>
> On Tue, Mar 22, 2022 at 9:41 AM Paul Chandler  wrote:
>
>> Hi all,
>>
>> Was there any further progress made on this? Did a Jira get created?
>>
>> I have been debugging our backup scripts and seem to have found the same
>> problem.
>>
>> As far as I can work out so far, it seems that this happens when a new
>> snapshot is created and the old snapshot is being tarred.
>>
>> I get a similar message:
>>
>> /bin/tar:
>> var/lib/cassandra/backup/keyspacename/tablename-4eec3b01aba811e896342351775ccc66/snapshots/csbackup_2022-03-22T14\\:04\\:05/nb-523601-big-Data.db:
>> file changed as we read it
>>
>> Thanks
>>
>> Paul
>>
>>
>>
>> On 19 Mar 2022, at 02:41, Dinesh Joshi  wrote:
>>
>> Do you have a repro that you can share with us? If so, please file a jira
>> and we'll take a look.
>>
>> On Mar 18, 2022, at 12:15 PM, James Brown  wrote:
>>
>> This in 4.0.3 after running nodetool snapshot that we're seeing sstables
>> change, yes.
>>
>> James Brown
>> Infrastructure Architect @ easypost.com
>>
>>
>> On 2022-03-18 at 12:06:00, Jeff Jirsa  wrote:
>>
>>> This is nodetool snapshot yes? 3.11 or 4.0?
>>>
>>> In versions prior to 3.0, sstables would be written with -tmp- in the
>>> name, then renamed when complete, so an sstable definitely never changed
>>> once it had the final file name. With the new transaction log mechanism, we
>>> use one name and a transaction log to note what's in flight and what's not,
>>> so if the snapshot system is including sstables being written (from flush,
>>> from compaction, or from streaming), those aren't final and should be
>>> skipped.
>>>
>>>
>>>
>>>
>>> On Fri, Mar 18, 2022 at 11:46 AM James Brown 
>>> wrote:
>>>
>>>> We use the boring combo of cassandra snapshots + tar to backup our
>>>> cassandra nodes; every once in a while, we'll notice tar failing with the
>>>> following:
>>>>
>>>> tar:
>>>> data/addresses/addresses-eb0196100b7d11ec852b1541747d640a/snapshots/backup20220318183708/nb-167-big-Data.db:
>>>> file changed as we read it
>>>>
>>>> I find this a bit perplexing; what would cause an sstable inside a
>>>> snapshot to change? The only thing I can think of is an incremental repair
>>>> changing the "repaired_at" flag on the sstable, but it seems like that
>>>> should "un-share" the hardlinked sstable rather than running the risk of
>>>> mutating a snapshot.
>>>>
>>>>
>>>> James Brown
>>>> Cassandra admin @ easypost.com
>>>>
>>>
>>
>>

Re: sstables changing in snapshots

2022-03-18 Thread Jeff Jirsa

This is nodetool snapshot yes? 3.11 or 4.0?

In versions prior to 3.0, sstables would be written with -tmp- in the name,
then renamed when complete, so an sstable definitely never changed once it
had the final file name. With the new transaction log mechanism, we use one
name and a transaction log to note what's in flight and what's not, so if
the snapshot system is including sstables being written (from flush, from
compaction, or from streaming), those aren't final and should be skipped.

On Fri, Mar 18, 2022 at 11:46 AM James Brown  wrote:

> We use the boring combo of cassandra snapshots + tar to backup our
> cassandra nodes; every once in a while, we'll notice tar failing with the
> following:
>
> tar:
> data/addresses/addresses-eb0196100b7d11ec852b1541747d640a/snapshots/backup20220318183708/nb-167-big-Data.db:
> file changed as we read it
>
> I find this a bit perplexing; what would cause an sstable inside a
> snapshot to change? The only thing I can think of is an incremental repair
> changing the "repaired_at" flag on the sstable, but it seems like that
> should "un-share" the hardlinked sstable rather than running the risk of
> mutating a snapshot.
>
>
> James Brown
> Cassandra admin @ easypost.com
>

Re: Gossips pending task increasing, nodes are DOWN

2022-03-17 Thread Jeff Jirsa

This release is from Sep 2016 (5.5 years ago) and has no fixes applied to
it since. There are likely MANY issues with that version.

On Thu, Mar 17, 2022 at 9:07 AM Jean Carlo 
wrote:

> Hello,
>
> After some restart, we go a list of nodes unreachable. These nodes are
> being seen as DOWN for the rest of the peers but they are running and keep
> accumulating gossip pending taks. Restarting the node does not solve the
> problem.
>
> The cassandra version is the 3.9 and the cluster has 80 nodes .
>
> Is there an issue with this version ?
>
> Jean Carlo
>
> "The best way to predict the future is to invent it" Alan Kay
>

Re: Cassandra Management tools?

2022-03-01 Thread Jeff Jirsa

Most teams are either using things like ansible/python scripts, or have
bespoke infrastructure.

Some of what you're describing is included in the intent of the
`cassandra-sidecar` project:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224

Goals
We target two main goals for the first version of the sidecar, both work
towards having a easy to use control plane for managing Cassandra’s data
plane.
Provide an extensible and pluggable architecture for developers and
operators to easily operate Cassandra as well as easing integration with
their existing infrastructure. One major sub-goal of this goal is:
The proposal should pass the “curl test”: meaning that it is accessible to
standard tooling and out of the box libraries available for practically
every environment or programming language (including python, ruby, bash).
This means that as a public interface we cannot choose Java specific (jmx)
or Cassandra specific (CQL) APIs.
Provide basic but essential and useful functionality. Some proposed scope
in this document:
Run health checks on replicas and the cluster
Run diagnostic commands on individual nodes as well as all nodes in the
cluster (bulk commands)
Export metrics via pluggable agents rather than polling JMX
Schedule periodic management activities such as running clean ups
(as a stretch goal) safely restart all nodes in the cluster.

The health checker seems to be implemented, I'm not sure if the coordinated
cleanup or similar exist yet (or if there are JIRAs around for them). In
theory, this type of work - outside the database, in automation - should be
really easy for newcomers who are solving their own problems.

Other things that sorta fall into this space, but may be not quite what
you're looking for:

- https://github.com/Netflix/Priam (if you run very much like netflix runs,
especially on AWS)
- https://github.com/thelastpickle/cassandra-reaper for the repair
automation
- https://github.com/JeremyGrosser/tablesnap (old-ish, for backups)

On Tue, Mar 1, 2022 at 11:05 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thanks all - I'll take a look at Ansible.  Back in my Hadoop days, we
> would use Cloudera manager (course that now costs $). Sounds like we
> need a new open source project!  :)
>
> -Joe
>
> On 3/1/2022 7:46 AM, Bowen Song wrote:
> > We use Ansible to manage a fairly large (200+ nodes) cluster. We
> > created our own Ansible playbooks for common tasks, such as rolling
> > restart. We also use Cassandra Reaper for scheduling and running
> > repairs on the same cluster. We occasionally also use pssh (parallel
> > SSH) for inspecting the logs or configurations on selected nodes.
> > Running pssh on very larger number of servers is obviously not
> > practical due the the available screen space constraint.
> >
> > On 28/02/2022 21:59, Joe Obernberger wrote:
> >> Hi all - curious what tools are folks using to manage large Cassandra
> >> clusters?  For example, to do tasks such as nodetool cleanup after a
> >> node or nodes are added to the cluster, or simply rolling start/stops
> >> after an update to the config or a new version?
> >> We've used puppet before; is that what other folks are using?
> >> Thanks for any suggestions.
> >>
> >> -Joe
> >>
> >
>

Re: [RELEASE] Apache Cassandra 4.0.2 released

2022-02-11 Thread Jeff Jirsa

We don't HAVE TO remove the Config.java entry - we can mark it as
deprecated and ignored and remove it in a future version (and you could
update Config.java to log a message about having a deprecated config
option). It's a much better operator experience: log for a major version,
then remove in the next.

On Fri, Feb 11, 2022 at 2:41 PM Ekaterina Dimitrova 
wrote:

> This had to be removed in 4.0 but it wasn’t. The patch mentioned did it to
> fix a bug that gives impression those work. Confirmed with Benedict on the
> ticket.
>
> I agree I absolutely had to document it better, a ticket for documentation
> was opened but it slipped from my mind with this emergency release this
> week. It is unfortunate it is still in our backlog after the ADOC migration.
>
> Note taken. I truly apologize and I am going to prioritize
> CASSANDRA-17135. Let me know if there is anything else I can/should do at
> this point.
>
> On Fri, 11 Feb 2022 at 17:26, Erick Ramirez 
> wrote:
>
>> (moved dev@ to BCC)
>>
>>
>>> It looks like the otc_coalescing_strategy config key is no longer
>>> supported in cassandra.yaml in 4.0.2, despite this not being mentioned
>>> anywhere in CHANGES.txt or NEWS.txt.
>>>
>>
>> James, you're right -- it was removed by CASSANDRA-17132
>>  in 4.0.2 and 4.1.
>>
>> I agree that the CHANGES.txt entry should be clearer and we'll improve it
>> plus add detailed info in NEWS.txt. I'll get this done soon in
>> CASSANDRA-17135 .
>> Thanks for the feedback. Cheers!
>>
>

Re: [RELEASE] Apache Cassandra 4.0.2 released

2022-02-11 Thread Jeff Jirsa

Accidentally dropped dev@, so adding back in the dev list, with the hopes
that someone on the dev list helps address this.

On Fri, Feb 11, 2022 at 2:22 PM Jeff Jirsa  wrote:

> That looks like https://issues.apache.org/jira/browse/CASSANDRA-17132 +
> https://github.com/apache/cassandra/commit/b6f61e850c8cfb1f0763e0f15721cde8893814b5
>
> I suspect this needs to be reverted, at least in 4.0.x, and it definitely
> deserved a NEWS.txt entry (and ideally some period of deprecation/warning).
>
>
>
> On Fri, Feb 11, 2022 at 2:09 PM James Brown  wrote:
>
>> It looks like the otc_coalescing_strategy config key is no longer
>> supported in cassandra.yaml in 4.0.2, despite this not being mentioned
>> anywhere in CHANGES.txt or NEWS.txt.
>>
>> Attempting to upgrade a test cluster from 4.0.1 to 4.0.2 failed with 
>> *"Invalid
>> yaml. Please remove properties [otc_coalescing_strategy] from your
>> cassandra.yaml*"
>>
>> Is this change intentional?
>>
>> On Fri, Feb 11, 2022 at 1:40 AM Mick Semb Wever  wrote:
>>
>>> The Cassandra team is pleased to announce the release of Apache
>>> Cassandra version 4.0.2.
>>>
>>> Apache Cassandra is a fully distributed database. It is the right
>>> choice when you need scalability and high availability without
>>> compromising performance.
>>>
>>>  http://cassandra.apache.org/
>>>
>>> Downloads of source and binary distributions are listed in our download
>>> section:
>>>
>>>  http://cassandra.apache.org/download/
>>>
>>> This version is a bug fix release[1] on the 4.0 series. As always,
>>> please pay attention to the release notes[2] and Let us know[3] if you
>>> were to encounter any problem.
>>>
>>> Enjoy!
>>>
>>> [1]: CHANGES.txt
>>>
>>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-4.0.2
>>> [2]: NEWS.txt
>>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-4.0.2
>>> [3]: https://issues.apache.org/jira/browse/CASSANDRA
>>>
>>
>>
>> --
>> James Brown
>> Engineer
>>
>

Re: [RELEASE] Apache Cassandra 4.0.2 released

2022-02-11 Thread Jeff Jirsa

That looks like https://issues.apache.org/jira/browse/CASSANDRA-17132 +
https://github.com/apache/cassandra/commit/b6f61e850c8cfb1f0763e0f15721cde8893814b5

I suspect this needs to be reverted, at least in 4.0.x, and it definitely
deserved a NEWS.txt entry (and ideally some period of deprecation/warning).



On Fri, Feb 11, 2022 at 2:09 PM James Brown  wrote:

> It looks like the otc_coalescing_strategy config key is no longer
> supported in cassandra.yaml in 4.0.2, despite this not being mentioned
> anywhere in CHANGES.txt or NEWS.txt.
>
> Attempting to upgrade a test cluster from 4.0.1 to 4.0.2 failed with *"Invalid
> yaml. Please remove properties [otc_coalescing_strategy] from your
> cassandra.yaml*"
>
> Is this change intentional?
>
> On Fri, Feb 11, 2022 at 1:40 AM Mick Semb Wever  wrote:
>
>> The Cassandra team is pleased to announce the release of Apache
>> Cassandra version 4.0.2.
>>
>> Apache Cassandra is a fully distributed database. It is the right
>> choice when you need scalability and high availability without
>> compromising performance.
>>
>>  http://cassandra.apache.org/
>>
>> Downloads of source and binary distributions are listed in our download
>> section:
>>
>>  http://cassandra.apache.org/download/
>>
>> This version is a bug fix release[1] on the 4.0 series. As always,
>> please pay attention to the release notes[2] and Let us know[3] if you
>> were to encounter any problem.
>>
>> Enjoy!
>>
>> [1]: CHANGES.txt
>>
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-4.0.2
>> [2]: NEWS.txt
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-4.0.2
>> [3]: https://issues.apache.org/jira/browse/CASSANDRA
>>
>
>
> --
> James Brown
> Engineer
>

Re: Running enablefullquerylog crashes cassandra

2022-02-06 Thread Jeff Jirsa

That looks like a nodetool stack - can you check the Cassandra log for 
corresponding error? 

> On Feb 6, 2022, at 12:52 AM, Gil Ganz  wrote:
> 
> 
> Hey
> I'm trying to enable full query log on cassandra 4.01 node and it's causing 
> cassandra to shutdown
> 
> nodetool enablefullquerylog --path /mnt/fql_data
> 
> Cassandra has shutdown.
> error: null
> -- StackTrace --
> java.io.EOFException
> at java.io.DataInputStream.readByte(DataInputStream.java:267)
> at 
> sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:222)
> at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
> at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
> at javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown 
> Source)
> at 
> javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1020)
> at 
> javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:298)
> at com.sun.proxy.$Proxy6.enableFullQueryLogger(Unknown Source)
> at 
> org.apache.cassandra.tools.NodeProbe.enableFullQueryLogger(NodeProbe.java:1836)
> at 
> org.apache.cassandra.tools.nodetool.EnableFullQueryLog.execute(EnableFullQueryLog.java:62)
> at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358)
> at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343)
> at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246)
> at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:84)
> 
> /mnt/fql_data is owned by cassandra user
> Doesn't matter if directory is empty or not
> 
> contents of cassandra.yaml 
> 
> full_query_logging_options:
>  log_dir: /mnt/fql_data
>  roll_cycle: HOURLY
>  block: true
>  max_queue_weight: 268435456
>  max_log_size: 34359738368
> # archive command is "/path/to/script.sh %path" where %path is replaced 
> with the file being rolled:
> # archive_command:
> # max_archive_retries: 10
> 
> No errors in the log, last couple of lines are 
> 
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2022-02-06 08:41:50,334 
> BinLog.java:420 - Attempting to configure bin log: Path: /mnt/fql_data Roll 
> cycle: HOURLY Blocking: true Max queue weight: 268435456 Max log 
> size:34359738368 Archive command:
> INFO  [RMI TCP Connection(10)-127.0.0.1] 2022-02-06 08:41:50,335 
> BinLog.java:433 - Cleaning directory: /mnt/fql_data as requested
> 
> I noticed there is a similiar bug 
> https://issues.apache.org/jira/browse/CASSANDRA-17136  but I also tried 
> setting disk_failure_policy to ignore, same thing.
> Has Anyone encountered something similar? 
> 
> 
> 
> 
> 
> gil

Re: about memory problem in write heavy system..

2022-01-07 Thread Jeff Jirsa

3.11.4 is a very old release, with lots of known bugs. It's possible the
memory is related to that.

If you bounce one of the old nodes, where does the memory end up?


On Thu, Jan 6, 2022 at 3:44 PM Eunsu Kim  wrote:

>
> Looking at the memory usage chart, it seems that the physical memory usage
> of the existing node has increased since the new node was added with
> auto_bootstrap=false.
>
>
>
>
> On Fri, Jan 7, 2022 at 1:11 AM Eunsu Kim  wrote:
>
>> Hi,
>>
>> I have a Cassandra cluster(3.11.4) that does heavy writing work. (14k~16k
>> write throughput per second per node)
>>
>> Nodes are physical machine in data center. Number of nodes are 30. Each
>> node has three data disks mounted.
>>
>>
>> A few days ago, a QueryTimeout problem occurred due to Full GC.
>> So, referring to this blog(
>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html), it seemed to
>> have been solved by changing the memtable_allocation_type to
>> offheap_objects.
>>
>> But today, I got an alarm saying that some nodes are using more than 90%
>> of physical memory. (115GiB /125GiB)
>>
>> Native memory usage of some nodes is gradually increasing.
>>
>>
>>
>> All tables use TWCS, and TTL is 2 weeks.
>>
>> Below is the applied jvm option.
>>
>> -Xms31g
>> -Xmx31g
>> -XX:+UseG1GC
>> -XX:G1RSetUpdatingPauseTimePercent=5
>> -XX:MaxGCPauseMillis=500
>> -XX:InitiatingHeapOccupancyPercent=70
>> -XX:ParallelGCThreads=24
>> -XX:ConcGCThreads=24
>> …
>>
>>
>> What additional things can I try?
>>
>> I am looking forward to the advice of experts.
>>
>> Regards.
>>
>
>

Re: Node failed after drive failed

2021-12-11 Thread Jeff Jirsa

Likely lost (enough of) the system key space on that disk so the data files 
indicating the host was in the cluster are missing and the host tried to 
rebootstrap



> On Dec 11, 2021, at 12:47 PM, Bowen Song  wrote:
> 
> 
> Hi Joss,
> 
> To unsubscribe from this mailing list, please send an email to 
> user-unsubscr...@cassandra.apache.org, not the mailing list itself 
> (user@cassandra.apache.org).
> 
>> On 09/12/2021 16:14, Joss wrote:
>> unsubscribe
>> 
>> On Mon, 6 Dec 2021 at 14:12, Joe Obernberger  
>> wrote:
>>> Hi All - one node in an 11 node cluster experienced a drive failure on 
>>> the first drive in the list.  I removed that drive from the list so that 
>>> it now reads:
>>> 
>>> data_file_directories:
>>>  - /data/2/cassandra/data
>>>  - /data/3/cassandra/data
>>>  - /data/4/cassandra/data
>>>  - /data/5/cassandra/data
>>>  - /data/6/cassandra/data
>>>  - /data/8/cassandra/data
>>>  - /data/9/cassandra/data
>>> 
>>> But when I try to start the server, I get:
>>> 
>>> Exception (java.lang.RuntimeException) encountered during startup: A 
>>> node with address /172.16.100.251:7000 already exists, cancelling join. 
>>> Use cassandra.replace_address if you want to replace this node.
>>> java.lang.RuntimeException: A node with address /172.16.100.251:7000 
>>> already exists, cancelling join. Use cassandra.replace_address if you 
>>> want to replace this node.
>>>  at 
>>> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
>>>  at 
>>> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
>>>  at 
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
>>>  at 
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
>>>  at 
>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
>>>  at 
>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
>>>  at 
>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
>>> ERROR [main] 2021-12-05 15:49:48,446 CassandraDaemon.java:909 - 
>>> Exception encountered during startup
>>> java.lang.RuntimeException: A node with address /172.16.100.251:7000 
>>> already exists, cancelling join. Use cassandra.replace_address if you 
>>> want to replace this node.
>>>  at 
>>> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
>>>  at 
>>> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
>>>  at 
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
>>>  at 
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
>>>  at 
>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
>>>  at 
>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
>>>  at 
>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
>>> INFO  [StorageServiceShutdownHook] 2021-12-05 15:49:48,468 
>>> HintsService.java:220 - Paused hints dispatch
>>> WARN  [StorageServiceShutdownHook] 2021-12-05 15:49:48,470 
>>> Gossiper.java:1993 - No local state, state is in silent shutdown, or 
>>> node hasn't joined, not announcing shutdown
>>> 
>>> Do I need to remove and re-add the node?  When a drive fails with 
>>> cassandra, is it common for the node to come down?
>>> 
>>> Thank you!
>>> 
>>> -Joe Obernberger
>>>

Re: Which source replica does rebuild stream from?

2021-11-25 Thread Jeff Jirsa


Using each and local consistencies here gives you some safety in the transient 
steps but also suggests you have control over when you move traffic 

Is all traffic going to the first DC while you add the second?
If so, set RF=3 and run repair before you move traffic

If you were using quorum instead of local, you’d:
- go from RF=0 to 1 in the new DC
- run rebuild, then run full or incremental repair (4.0+)
- go from rf=1 to 2, then rebuild then repair
- go from rf=2 to 3 then rebuild then repair

Tearing down a dc do the inverse

But again, using each and local here is pretty safe - you’re confining your 
reads to where you query and you can do a single rebuild 
 + repair after going to 3 


> On Nov 25, 2021, at 11:53 AM, Sam Kramer  wrote:
> 
> 
> Hi both, thank you for your responses!
> 
> Yes Jeff, we expect strictly correct responses. Our starting / ending 
> topologies are near-identical (DC1: A/B/C, DC2: A/B/C), and reads are 
> performed at LOCAL_QUORUM, while writes are done at EACH_QUORUM or ALL.
> 
> Thanks,
> Sam
> 
>> On Thu, Nov 25, 2021 at 9:38 AM Jeff Jirsa  wrote:
>> The risk is not negligible if you expect strictly correct responses
>> 
>> The only way to do this correctly is very, very labor intensive at the 
>> moment, and it requires repair between rebuilds and incrementally adding 
>> replicas such that you don’t violate consistency 
>> 
>> If you give me the starting topology, ending topology, and what consistency 
>> level you use for reads and writes I’ll describe the changes you have to do 
>> to do this safely
>> 
>> 
>> 
>>>> On Nov 25, 2021, at 8:49 AM, Erick Ramirez  
>>>> wrote:
>>>> 
>>> 
>>> Yes, you are correct that the source may not necessarily be fully 
>>> consistent. But this risk is negligible if your cluster is sized-correctly 
>>> and nodes are not dropping mutations.
>>> 
>>> If your nodes are dropping mutations because they're overloaded and cannot 
>>> keep up with writes, rebuild is probably the least of your problems. Cheers!

Re: Which source replica does rebuild stream from?

2021-11-25 Thread Jeff Jirsa

The risk is not negligible if you expect strictly correct responses

The only way to do this correctly is very, very labor intensive at the moment, 
and it requires repair between rebuilds and incrementally adding replicas such 
that you don’t violate consistency 

If you give me the starting topology, ending topology, and what consistency 
level you use for reads and writes I’ll describe the changes you have to do to 
do this safely



> On Nov 25, 2021, at 8:49 AM, Erick Ramirez  wrote:
> 
> 
> Yes, you are correct that the source may not necessarily be fully consistent. 
> But this risk is negligible if your cluster is sized-correctly and nodes are 
> not dropping mutations.
> 
> If your nodes are dropping mutations because they're overloaded and cannot 
> keep up with writes, rebuild is probably the least of your problems. Cheers!

Re: Cross DC replication failing

2021-11-13 Thread Jeff Jirsa





> On Nov 13, 2021, at 10:25 AM, Inquistive allen  wrote:
> 
> 
> Hello team,
> Greetings.
> 
> Simple question
> 
> Using Cassandra 3.0.8
> Writing to DC-A using local_quorum
> Reading the same data from a DC-B using local quorum.
> 
> It succeeds for a table and fails for other.
> Data written is not replicated immediately even after a long time.
> However if read is performed using a quorum we fetch data.
> 
> I do understand what is happening
> 
> Any idea
> 1. How to track how many messages are yet to be replicated

Potentially pending hints but not really



> 2. Any way to understand why they are not getting replicated across DC

Logs and tracing

> 3. I know I can force a sync by repair but this cross DC replication is 
> expected to be automatic. Am I correct?


Yes
This is probably some form of bug or configuration error. 3.0.8 is quite old 
and has a lot of known bugs, I’d start by upgrading to newest 3.0 or 3.11 or 4.0

Re: One big giant cluster or several smaller ones?

2021-11-12 Thread Jeff Jirsa

Oh sorry - a cluster per application makes sense. Sharding within an
application makes sense to avoid very very very large clusters (think:
~thousand nodes). 1 cluster per app/use case.

On Fri, Nov 12, 2021 at 1:39 PM S G  wrote:

> Thanks Jeff.
> Any side-effect on the client config from small clusters perspective?
>
> Like several smaller clusters means more CassandraClient objects on the
> client side but I guess number of connections shall remain the same as
> number of physical nodes will most likely remain the same only. So I think
> client side would not see any major issue.
>
>
> On Fri, Nov 12, 2021 at 11:46 AM Jeff Jirsa  wrote:
>
>> Most people are better served building multiple clusters and spending
>> their engineering time optimizing for maintaining multiple clusters, vs
>> spending their engineering time learning how to work around the sharp edges
>> that make large shared clusters hard.
>>
>> Large multi-tenant clusters give you less waste and a bit more elasticity
>> (one tenant can burst and use spare capacity that would typically be left
>> for the other tenants). However, one bad use case / table can ruin
>> everything (one bad read that generates GC hits all use cases), and
>> eventually certain mechanisms/subsystems dont scale past certain points
>> (e.g. schema - large schemas and large clusters are much harder than small
>> schemas and small clusters)
>>
>>
>>
>>
>> On Fri, Nov 12, 2021 at 11:31 AM S G  wrote:
>>
>>> Hello,
>>>
>>> Is there any case where we would prefer one big giant cluster (with
>>> multiple large tables) over several smaller clusters?
>>> Apart from some management overhead of multiple Cassandra Clients, it
>>> seems several smaller clusters are always better than a big one:
>>>
>>>1. Avoids SPOF for all tables
>>>2. Helps debugging (less noise from all tables in the logs)
>>>3. Traffic spikes on one table do not affect others if they are in
>>>different tables.
>>>4. We can scale tables independently of each other - so colder data
>>>can be in a smaller cluster (more data/node) while hotter data can be on 
>>> a
>>>bigger cluster (less data/node)
>>>
>>>
>>> It does not mean that every table should be in its own cluster.
>>> But large ones can be moved to their own dedicated clusters (like those
>>> more than a few terabytes).
>>> And smaller ones can be clubbed together in one or few clusters.
>>>
>>> Please share any recommendations for the above from actual production
>>> experiences.
>>> Thanks for helping !
>>>
>>>

Re: One big giant cluster or several smaller ones?

2021-11-12 Thread Jeff Jirsa

Most people are better served building multiple clusters and spending their
engineering time optimizing for maintaining multiple clusters, vs spending
their engineering time learning how to work around the sharp edges that
make large shared clusters hard.

Large multi-tenant clusters give you less waste and a bit more elasticity
(one tenant can burst and use spare capacity that would typically be left
for the other tenants). However, one bad use case / table can ruin
everything (one bad read that generates GC hits all use cases), and
eventually certain mechanisms/subsystems dont scale past certain points
(e.g. schema - large schemas and large clusters are much harder than small
schemas and small clusters)

On Fri, Nov 12, 2021 at 11:31 AM S G  wrote:

> Hello,
>
> Is there any case where we would prefer one big giant cluster (with
> multiple large tables) over several smaller clusters?
> Apart from some management overhead of multiple Cassandra Clients, it
> seems several smaller clusters are always better than a big one:
>
>1. Avoids SPOF for all tables
>2. Helps debugging (less noise from all tables in the logs)
>3. Traffic spikes on one table do not affect others if they are in
>different tables.
>4. We can scale tables independently of each other - so colder data
>can be in a smaller cluster (more data/node) while hotter data can be on a
>bigger cluster (less data/node)
>
>
> It does not mean that every table should be in its own cluster.
> But large ones can be moved to their own dedicated clusters (like those
> more than a few terabytes).
> And smaller ones can be clubbed together in one or few clusters.
>
> Please share any recommendations for the above from actual production
> experiences.
> Thanks for helping !
>
>

Re: Cassandra Delete Query Doubt

2021-11-10 Thread Jeff Jirsa

This type of delete - which doesnt supply a user_id, so it's deleting a
range of rows - creates what is known as a range tombstone. It's not tied
to any given cell, as it covers a range of cells, and supersedes/shadows
them when merged (either in the read path or compaction path).



On Wed, Nov 10, 2021 at 4:27 AM raman gugnani 
wrote:

> HI Team,
>
>
> I have one table below and want to delete data on this table.
>
>
> DELETE  FROM game.tournament USING TIMESTAMP 161692578000 WHERE
> tournament_id = 1 AND version_id = 1 AND partition_id = 1;
>
>
> Cassandra internally manages the timestamp of each column when some data
> is updated on the same column.
>
>
> My Query is , *USING TIMESTAMP 161692578000* picks up a timestamp of
> which column ?
>
>
>
> CREATE TABLE game.tournament (
>
> tournament_id bigint,
>
> version_id bigint,
>
> partition_id bigint,
>
> user_id bigint,
>
> created_at timestamp,
>
> rank bigint,
>
> score bigint,
>
> updated_at timestamp,
>
> PRIMARY KEY ((tournament_id, version_id, partition_id), user_id)
>
> ) WITH CLUSTERING ORDER BY (user_id ASC)
>
>
>
>
>
>
>
> --
> Raman Gugnani
>

Re: How does a node decide where each of its vnodes will be replicated to?

2021-11-08 Thread Jeff Jirsa

I think your mental model here is trying to map a different db concept
(like elasticsearch shards) to a distributed hash table that doesnt really
map that way.

There's no physical thing as a vnode. Vnode, as a concept, is "a single
node runs multiple tokens and owns multiple ranges". Multiple ranges are
the "vnodes". There's not a block of data that is a vnode. There's just
hosts and ranges they own.

On Mon, Nov 8, 2021 at 4:07 PM Tech Id  wrote:

> Thanks Jeff.
>
> One follow-up question please: Each node specifies num_tokens.
> So if there are 4 nodes and each specifies 256 tokens, then it means
> together they are responsible for 1024 vnodes.
> Now, when a fifth node joins and has num_tokens set to 256 as well, then
> does the system have 1024+256 = 1280 vnodes?
>
> Or the number of vnodes remains constant in the system and the nodes just
> divide that according to their num_token's weightage?
> So in the above example, number of vnodes is say constant at 1000
> With 4 nodes each specifying 256 vnodes, every node in reality gets 1000/4
> = 250 vnodes
> With 5 nodes each specifying 256 vnodes, every node gets 1000/5 = 200
> vnodes
>
>
>
> On Mon, Nov 8, 2021 at 3:33 PM Jeff Jirsa  wrote:
>
>> When a machine starts for the first time, the joining node basically
>> chooses a number of tokens (num_tokens) randomly within the range of the
>> partitioner (for murmur3, -2**63 to 2**63), and then bootstraps to claim
>> them.
>>
>> This is sort of a lie, in newer versions, we try to make it a bit more
>> deterministic (it tries to ensure an even distribution), but if you just
>> think of it as random, it'll make more sense.
>>
>> The only thing that enforces any meaningful order or distribution here is
>> a rack-aware snitch, which will ensure that the RF copies of data land on
>> as many racks as possible (which is where it may skip some tokens, if
>> they're found to be on the same rack)
>>
>>
>> On Mon, Nov 8, 2021 at 3:22 PM Tech Id  wrote:
>>
>>> Thanks Jeff.
>>> I think what you explained below is before and after vnodes introduction.
>>> The vnodes part is clear - how each node holds a small range of tokens
>>> and how each node holds a discontiguous set of vnodes.
>>>
>>>1. What is not clear is how each node decided what vnodes it will
>>>get. If it were contiguous, it would have been easy to understand (like
>>>token range).
>>>2. Also the original question of this thread: If each node does not
>>>replicate all its vnodes to the same 2 nodes (assume RF=2), then how does
>>>it decide where each of its vnode will be replicated to?
>>>
>>> Maybe the answer to #2 is apparent in #1 answer.
>>> But I would really appreciate if someone can help me understand the
>>> above.
>>>
>>>
>>>
>>> On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa  wrote:
>>>
>>>> Vnodes are implemented by giving a single process multiple tokens.
>>>>
>>>> Tokens ultimately determine which data lives on which node. When you
>>>> hash a partition key, it gives you a token (let's say 570). The 3 processes
>>>> that own token 57 are the next 3 tokens in the ring ABOVE 570, so if you
>>>> had
>>>> A = 0
>>>> B = 1000
>>>> C = 2000
>>>> D = 3000
>>>> E = 4000
>>>>
>>>> The replicas for data for token=570 are B,C,D
>>>>
>>>>
>>>> When you have vnodes and there's lots of tokens (from the same small
>>>> set of 5 hosts), it'd look closer to:
>>>> A = 0
>>>> C = 100
>>>> A = 300
>>>> B = 700
>>>> D = 800
>>>> B = 1000
>>>> D = 1300
>>>> C = 1700
>>>> B = 1800
>>>> C = 2000
>>>> E = 2100
>>>> B = 2400
>>>> A = 2900
>>>> D = 3000
>>>> E = 4000
>>>>
>>>> In this case, the replicas for token=570 are B, D and C (it would go B,
>>>> D, B, D, but we would de-duplicate the B and D and look for the next
>>>> non-B/non-D host.= D at 1700)
>>>>
>>>> If you want to see a view of this in your own cluster, use `nodetool
>>>> ring` to see the full token ring.
>>>>
>>>> There's no desire to enforce a replication mapping where all data on A
>>>> is replicated to the same set of replicas of A, because the point of vnodes
>>>> i

Re: How does a node decide where each of its vnodes will be replicated to?

2021-11-08 Thread Jeff Jirsa

When a machine starts for the first time, the joining node basically
chooses a number of tokens (num_tokens) randomly within the range of the
partitioner (for murmur3, -2**63 to 2**63), and then bootstraps to claim
them.

This is sort of a lie, in newer versions, we try to make it a bit more
deterministic (it tries to ensure an even distribution), but if you just
think of it as random, it'll make more sense.

The only thing that enforces any meaningful order or distribution here is a
rack-aware snitch, which will ensure that the RF copies of data land on as
many racks as possible (which is where it may skip some tokens, if they're
found to be on the same rack)


On Mon, Nov 8, 2021 at 3:22 PM Tech Id  wrote:

> Thanks Jeff.
> I think what you explained below is before and after vnodes introduction.
> The vnodes part is clear - how each node holds a small range of tokens and
> how each node holds a discontiguous set of vnodes.
>
>1. What is not clear is how each node decided what vnodes it will get.
>If it were contiguous, it would have been easy to understand (like token
>range).
>2. Also the original question of this thread: If each node does not
>replicate all its vnodes to the same 2 nodes (assume RF=2), then how does
>it decide where each of its vnode will be replicated to?
>
> Maybe the answer to #2 is apparent in #1 answer.
> But I would really appreciate if someone can help me understand the above.
>
>
>
> On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa  wrote:
>
>> Vnodes are implemented by giving a single process multiple tokens.
>>
>> Tokens ultimately determine which data lives on which node. When you hash
>> a partition key, it gives you a token (let's say 570). The 3 processes that
>> own token 57 are the next 3 tokens in the ring ABOVE 570, so if you had
>> A = 0
>> B = 1000
>> C = 2000
>> D = 3000
>> E = 4000
>>
>> The replicas for data for token=570 are B,C,D
>>
>>
>> When you have vnodes and there's lots of tokens (from the same small set
>> of 5 hosts), it'd look closer to:
>> A = 0
>> C = 100
>> A = 300
>> B = 700
>> D = 800
>> B = 1000
>> D = 1300
>> C = 1700
>> B = 1800
>> C = 2000
>> E = 2100
>> B = 2400
>> A = 2900
>> D = 3000
>> E = 4000
>>
>> In this case, the replicas for token=570 are B, D and C (it would go B,
>> D, B, D, but we would de-duplicate the B and D and look for the next
>> non-B/non-D host.= D at 1700)
>>
>> If you want to see a view of this in your own cluster, use `nodetool
>> ring` to see the full token ring.
>>
>> There's no desire to enforce a replication mapping where all data on A is
>> replicated to the same set of replicas of A, because the point of vnodes is
>> to give A many distinct replicas so when you replace A, it can replicate
>> from "many" other sources (maybe a dozen, maybe a hundred). This was super
>> important before 4.0, because each replication stream was single threaded
>> by SENDER, so vnodes let you use more than 2-3 cores to re-replicate (in
>> 4.0, it's still single threaded, but we avoid a lot of deserialization so
>> we can saturate a nic with only a few cores, that was much harder to do
>> before).
>>
>>
>> On Mon, Nov 8, 2021 at 1:44 PM Tech Id  wrote:
>>
>>>
>>> Hello,
>>>
>>> Going through
>>> https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html
>>> .
>>>
>>> But it is not clear how a node decides where each of its vnodes will be
>>> replicated to.
>>>
>>> As an example from the above page:
>>>
>>>1. Why is vnode A present in nodes 1,2 and 5
>>>2. BUT vnode B is present in nodes 1,4 and 6
>>>
>>>
>>> I realize that the diagram is for illustration purposes only, but the
>>> idea being conveyed should nevertheless be the same as I suggested above.
>>>
>>> So how come node 1 decides to put A on itself, 2 and 5 but put B on
>>> itself, 4 and 6 ?
>>> Shouldn't there be consistency here such that all vnodes present on A
>>> are replicated to same set of other nodes?
>>>
>>> Any clarifications on that would be appreciated.
>>>
>>> I also understand that different vnodes are replicated to different
>>> nodes for performance.
>>> But all I want to know is the algorithm that it uses to put them on
>>> different nodes.
>>>
>>> Thanks!
>>>
>>>

Re: How does a node decide where each of its vnodes will be replicated to?

2021-11-08 Thread Jeff Jirsa

Vnodes are implemented by giving a single process multiple tokens.

Tokens ultimately determine which data lives on which node. When you hash a
partition key, it gives you a token (let's say 570). The 3 processes that
own token 57 are the next 3 tokens in the ring ABOVE 570, so if you had
A = 0
B = 1000
C = 2000
D = 3000
E = 4000

The replicas for data for token=570 are B,C,D

When you have vnodes and there's lots of tokens (from the same small set of
5 hosts), it'd look closer to:
A = 0
C = 100
A = 300
B = 700
D = 800
B = 1000
D = 1300
C = 1700
B = 1800
C = 2000
E = 2100
B = 2400
A = 2900
D = 3000
E = 4000

In this case, the replicas for token=570 are B, D and C (it would go B, D,
B, D, but we would de-duplicate the B and D and look for the next
non-B/non-D host.= D at 1700)

If you want to see a view of this in your own cluster, use `nodetool ring`
to see the full token ring.

There's no desire to enforce a replication mapping where all data on A is
replicated to the same set of replicas of A, because the point of vnodes is
to give A many distinct replicas so when you replace A, it can replicate
from "many" other sources (maybe a dozen, maybe a hundred). This was super
important before 4.0, because each replication stream was single threaded
by SENDER, so vnodes let you use more than 2-3 cores to re-replicate (in
4.0, it's still single threaded, but we avoid a lot of deserialization so
we can saturate a nic with only a few cores, that was much harder to do
before).

On Mon, Nov 8, 2021 at 1:44 PM Tech Id  wrote:

>
> Hello,
>
> Going through
> https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html
> .
>
> But it is not clear how a node decides where each of its vnodes will be
> replicated to.
>
> As an example from the above page:
>
>1. Why is vnode A present in nodes 1,2 and 5
>2. BUT vnode B is present in nodes 1,4 and 6
>
>
> I realize that the diagram is for illustration purposes only, but the idea
> being conveyed should nevertheless be the same as I suggested above.
>
> So how come node 1 decides to put A on itself, 2 and 5 but put B on
> itself, 4 and 6 ?
> Shouldn't there be consistency here such that all vnodes present on A are
> replicated to same set of other nodes?
>
> Any clarifications on that would be appreciated.
>
> I also understand that different vnodes are replicated to different nodes
> for performance.
> But all I want to know is the algorithm that it uses to put them on
> different nodes.
>
> Thanks!
>
>

Re: 4.0.1 - adding a node

2021-10-28 Thread Jeff Jirsa

I think you started at 4930 and ended at 5461, difference of 530 (which is
the new host)

If you run `nodetool cleanup` on every other node in the cluster, you
likely drop back down close to 4931 again.



On Thu, Oct 28, 2021 at 12:04 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> I recently added a node to a cluster.  Immediately after adding the
> node, the cluster status (nyx is the new node):
>
> UJ  nyx.querymasters.com181.25 KiB 250 ?
> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1
> UN  enceladus.querymasters.com  569.53 GiB  200 35.1%
> 660f476c-a124-4ca0-b55f-75efe56370da  rack1
> UN  calypso.querymasters.com578.79 GiB  200 34.8%
> e83aa851-69b4-478f-88f6-60e657ea6539  rack1
> UN  fortuna.querymasters.com593.79 GiB  200 34.6%
> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
> UN  charon.querymasters.com 603.3 GiB   200 35.0%
> d9702f96-256e-45ae-8e12-69a42712be50  rack1
> UN  eros.querymasters.com   589.04 GiB  200 34.2%
> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
> UN  hercules.querymasters.com   12.31 GiB   4 0.7%
> a1a16910-9167-4174-b34b-eb859d36347e  rack1
> UN  ursula.querymasters.com 611.65 GiB  200 35.0%
> 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
> UN  gaia.querymasters.com   480.62 GiB  200 34.7%
> b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
> UN  chaos.querymasters.com  358.07 GiB  120 20.5%
> 08a19658-40be-4e55-8709-812b3d4ac750  rack1
> UN  pallas.querymasters.com 537.88 GiB  200 35.3%
> b74b6e65-af63-486a-b07f-9e304ec30a39  rack1
>
> If I add up the Load column, I get 4,632.67GiB.  After overnight:
>
> [joeo@calypso ~]$ nodetool status -r
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address LoadTokens  Owns (effective)
> Host ID   Rack
> UN  nyx.querymasters.com535.16 GiB  250 38.0%
> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1
> UN  enceladus.querymasters.com  568.87 GiB  200 30.4%
> 660f476c-a124-4ca0-b55f-75efe56370da  rack1
> UN  calypso.querymasters.com578.81 GiB  200 30.4%
> e83aa851-69b4-478f-88f6-60e657ea6539  rack1
> UN  fortuna.querymasters.com593.82 GiB  200 30.4%
> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
> UN  charon.querymasters.com 602.38 GiB  200 30.4%
> d9702f96-256e-45ae-8e12-69a42712be50  rack1
> UN  eros.querymasters.com   588.3 GiB   200 30.4%
> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
> UN  hercules.querymasters.com   12.31 GiB   4 0.6%
> a1a16910-9167-4174-b34b-eb859d36347e  rack1
> UN  ursula.querymasters.com 610.54 GiB  200 30.3%
> 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
> UN  gaia.querymasters.com   480.53 GiB  200 30.5%
> b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
> UN  chaos.querymasters.com  358.44 GiB  120 18.2%
> 08a19658-40be-4e55-8709-812b3d4ac750  rack1
> UN  pallas.querymasters.com 537.94 GiB  200 30.4%
> b74b6e65-af63-486a-b07f-9e304ec30a39  rack1
>
> If I add up the Load, I get 5,466.79GiB.  I have added no new data to
> the cluster, yet the Load has increased by 834.  Is this expected behavior?
> Thank you!
>
> -Joe
>
>

Re: Tombstones? 4.0.1

2021-10-25 Thread Jeff Jirsa

This is not the right data model for Cassandra. Strong encouragement to
watch one of Patrick McFadin's data modeling videos on youtube.

You very much want to always query where a WHERE clause, which usually
means knowing a partition key (or set of partition keys) likely to contain
your data, and using sorting within those keys to make data easy to access.
This may mean a partition key that's a date (so all writes for a given date
land on one replica set, and you only query within that set), or
(date,bucket) tuple (where bucket is an int from 0-1000, for example, which
avoids hostspotting), then you can read (date, 7) and (date, 996)  and
everything else in concurrent async queries, or something else that gives
you deterministic partitioning so you're not walking past all of those dead
tombstones.

Reading with a naive SELECT with no WHERE is going to be perhaps the least
performant way to do this in cassandra, and you are probably miserable at
both the response time and the failure rate, because this is not how
Cassandra is designed to work.



On Mon, Oct 25, 2021 at 3:56 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi Jeff - yes, I'm doing a select without where - specifically:  select
> uuid from table limit 1000;
> Not inserting nulls, and nothing is TTL'd.
> At this point with zero rows, the above select fails.
>
> Sounds like my application needs a redesign as doing 1 billion inserts,
> and 100 million deletes results in an unusable table.  I'm using Cassandra
> to de-duplicate data and that's not a good use case for it.
>
> -Joe
> On 10/25/2021 6:51 PM, Jeff Jirsa wrote:
>
> The tombstone threshold is "how many tombstones are encountered within a
> single read command", and the default is something like 100,000 (
> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1293-L1294
> )
>
> Deletes are not forbidden, but you have to read in such a way that you
> touch less than 100,000 deletes per read.
>
> Are you doing full table scans or SELECT without WHERE?
> Are you inserting nulls in some columns?
> Are you TTL'ing everything ?
>
>
>
> On Mon, Oct 25, 2021 at 3:28 PM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> Update - after 10 days, I'm able to use the table again; prior to that
>> all selects timed out.
>> Are deletes basically forbidden with Cassandra?  If you have a table
>> where you want to do lots of inserts and deletes, is there an option that
>> works in Cassandra?  Even thought the table now has zero rows, after
>> deleting them, I can no longer do a select from the table as it times out.
>> Thank you!
>>
>> -Joe
>> On 10/14/2021 3:38 PM, Joe Obernberger wrote:
>>
>> I'm not sure if tombstones is the issue; is it?  Grace is set to 10 days,
>> that time has not passed yet.
>>
>> -Joe
>> On 10/14/2021 1:37 PM, James Brown wrote:
>>
>> What is gc_grace_seconds set to on the table? Once that passes, you can
>> do `nodetool scrub` to more emphatically remove tombstones...
>>
>> On Thu, Oct 14, 2021 at 8:49 AM Joe Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>>
>>> Hi all - I have a table where I've needed to delete a number of rows.
>>> I've run repair, but I still can't select from the table.
>>>
>>> select * from doc.indexorganize limit 10;
>>> OperationTimedOut: errors={'172.16.100.37:9042': 'Client request
>>> timeout. See Session.execute[_async](timeout)'},
>>> last_host=172.16.100.37:9042
>>>
>>> Info on the table:
>>>
>>> nodetool tablestats doc.indexorganize
>>> Total number of tables: 97
>>> 
>>> Keyspace : doc
>>>  Read Count: 170275408
>>>  Read Latency: 1.6486837044783356 ms
>>>  Write Count: 6821769404
>>>  Write Latency: 0.08147347268570909 ms
>>>  Pending Flushes: 0
>>>  Table: indexorganize
>>>  SSTable count: 21
>>>  Old SSTable count: 0
>>>  Space used (live): 1536557040
>>>  Space used (total): 1536557040
>>>  Space used by snapshots (total): 1728378992
>>>  Off heap memory used (total): 46251932
>>>  SSTable Compression Ratio: 0.5218383898575761
>>>  Number of partitions (estimate): 17365415
>>>  Memtable cell count: 0
>>>  Memtable data size: 0
>>>

Re: Tombstones? 4.0.1

2021-10-25 Thread Jeff Jirsa

The tombstone threshold is "how many tombstones are encountered within a
single read command", and the default is something like 100,000 (
https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1293-L1294
)

Deletes are not forbidden, but you have to read in such a way that you
touch less than 100,000 deletes per read.

Are you doing full table scans or SELECT without WHERE?
Are you inserting nulls in some columns?
Are you TTL'ing everything ?



On Mon, Oct 25, 2021 at 3:28 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Update - after 10 days, I'm able to use the table again; prior to that all
> selects timed out.
> Are deletes basically forbidden with Cassandra?  If you have a table where
> you want to do lots of inserts and deletes, is there an option that works
> in Cassandra?  Even thought the table now has zero rows, after deleting
> them, I can no longer do a select from the table as it times out.
> Thank you!
>
> -Joe
> On 10/14/2021 3:38 PM, Joe Obernberger wrote:
>
> I'm not sure if tombstones is the issue; is it?  Grace is set to 10 days,
> that time has not passed yet.
>
> -Joe
> On 10/14/2021 1:37 PM, James Brown wrote:
>
> What is gc_grace_seconds set to on the table? Once that passes, you can do
> `nodetool scrub` to more emphatically remove tombstones...
>
> On Thu, Oct 14, 2021 at 8:49 AM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> Hi all - I have a table where I've needed to delete a number of rows.
>> I've run repair, but I still can't select from the table.
>>
>> select * from doc.indexorganize limit 10;
>> OperationTimedOut: errors={'172.16.100.37:9042': 'Client request
>> timeout. See Session.execute[_async](timeout)'},
>> last_host=172.16.100.37:9042
>>
>> Info on the table:
>>
>> nodetool tablestats doc.indexorganize
>> Total number of tables: 97
>> 
>> Keyspace : doc
>>  Read Count: 170275408
>>  Read Latency: 1.6486837044783356 ms
>>  Write Count: 6821769404
>>  Write Latency: 0.08147347268570909 ms
>>  Pending Flushes: 0
>>  Table: indexorganize
>>  SSTable count: 21
>>  Old SSTable count: 0
>>  Space used (live): 1536557040
>>  Space used (total): 1536557040
>>  Space used by snapshots (total): 1728378992
>>  Off heap memory used (total): 46251932
>>  SSTable Compression Ratio: 0.5218383898575761
>>  Number of partitions (estimate): 17365415
>>  Memtable cell count: 0
>>  Memtable data size: 0
>>  Memtable off heap memory used: 0
>>  Memtable switch count: 12
>>  Local read count: 17346304
>>  Local read latency: NaN ms
>>  Local write count: 31340451
>>  Local write latency: NaN ms
>>  Pending flushes: 0
>>  Percent repaired: 100.0
>>  Bytes repaired: 1.084GiB
>>  Bytes unrepaired: 0.000KiB
>>  Bytes pending repair: 0.000KiB
>>  Bloom filter false positives: 0
>>  Bloom filter false ratio: 0.0
>>  Bloom filter space used: 38030728
>>  Bloom filter off heap memory used: 38030560
>>  Index summary off heap memory used: 7653060
>>  Compression metadata off heap memory used: 568312
>>  Compacted partition minimum bytes: 51
>>  Compacted partition maximum bytes: 86
>>  Compacted partition mean bytes: 67
>>  Average live cells per slice (last five minutes):
>> 73.53164556962025
>>  Maximum live cells per slice (last five minutes): 5722
>>  Average tombstones per slice (last five minutes): 1.0
>>  Maximum tombstones per slice (last five minutes): 1
>>  Dropped Mutations: 0
>>
>> nodetool tablehistograms doc.indexorganize
>> doc/indexorganize histograms
>> Percentile  Read Latency Write Latency SSTablesPartition
>> SizeCell Count
>>  (micros) (micros) (bytes)
>> 50% 0.00  0.00 0.00
>> 60 1
>> 75% 0.00  0.00 0.00
>> 86 2
>> 95% 0.00  0.00 0.00
>> 86 2
>> 98% 0.00  0.00 0.00
>> 86 2
>> 99% 0.00  0.00 0.00
>> 86 2
>> Min 0.00  0.00 0.00
>> 51 0
>> Max 0.00  0.00 0.00
>> 86 2
>>
>> Any ideas on what I can do?  Thank you!
>>
>> -Joe
>>
>>
>
> --
> James Brown
> Engineer
>
>
>

Re: How to find traffic profile per client on a Cassandra server?

2021-10-24 Thread Jeff Jirsa

Table level metrics ? 

> On Oct 24, 2021, at 8:54 PM, S G  wrote:
> 
> 
> Hello,
> 
> We recently faced an issue recently where the read traffic on a big Cassandra 
> cluster shot up several times (think more than 20 times).
> 
> However, the client team denies sending any huge load and they have their own 
> traffic graphs to prove the same.
> 
> Assuming the client team's graphs are correct, how do we know the source of 
> traffic ? Slow query logging is enabled, but it only logs queries after a 
> certain threshold, so not very helpful.
> 
> Secondly, we do not know when the incidence will re-occur. So how do we solve 
> such a problem and put some monitoring in place that shows the source of such 
> huge spikes when it happens next time.
> 
> Thinking of trying lsof -i and netstat -tn commands in a per-minute cron on 
> each server but they only show connections from clients, not how many 
> requests in those connections.
> Any suggestions on how to go about this?
> 
> Thanks !
>

Re: Single node slowing down queries in a large cluster

2021-10-17 Thread Jeff Jirsa

Internode speculative retry is on by default with p99

The client side retry varies by driver / client 

> On Oct 17, 2021, at 1:59 PM, S G  wrote:
> 
> 
> 
> "The harder thing to solve is a bad coordinator node slowing down all reads 
> coordinated by that node"
> I think this is the root of the problem and since all nodes act as 
> coordinator nodes, so it guaranteed that if any 1 node slows down (High GC, 
> Segment Merging etc), it will slow down 1/N queries in the cluster (N = ring 
> size).
> 
> Speculative retry seems like a good option (non-percentile based) if it also 
> mandates the selection of a different server in the retry.
> 
> Is any kind of speculative retry turned on by default ?
> 
> 
> 
>> On Wed, Oct 13, 2021 at 2:33 PM Jeff Jirsa  wrote:
>> Some random notes, not necessarily going to help you, but:
>> - You probably have vnodes enable, which means one bad node is PROBABLY a 
>> replica of almost every other node, so the fanout here is worse than it 
>> should be, and
>> - You probably have speculative retry on the table set to a percentile. As 
>> the host gets slow, the percentiles change, and speculative retry stop being 
>> useful, so you end up timing out queries
>> 
>> If you change speculative retry to use the MIN(Xms, p99) syntax, with X set 
>> on your real workload, you can likely force it to speculate sooner when that 
>> one host gets sick.
>> 
>> The harder thing to solve is a bad coordinator node slowing down all reads 
>> coordinated by that node. Retry at the client level to work around that 
>> tends to be effective.
>> 
>> 
>> 
>>> On Wed, Oct 13, 2021 at 2:22 PM S G  wrote:
>>> Hello,
>>> 
>>> We have frequently seen that a single bad node running slow can affect the 
>>> latencies of the entire cluster (especially for queries where the slow node 
>>> was acting as a coordinator).
>>> 
>>> Is there any suggestion to avoid this behavior?
>>> Like something on the client side to not query that bad node or something 
>>> on the bad node that redirects its query to other healthy coordinators?
>>> 
>>> Thanks,
>>>

Re: Schema collision results in multiple data directories per table

2021-10-15 Thread Jeff Jirsa

Consistency doesnt matter for schema.

For every host: " select id from system_schema tables WHERE keyspace_name=?
and table_name=?" (
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/schema/SchemaKeyspace.java#L144
)

Then, compare that to the /path/to/data/keyspace/table-(id)/ on disk

If any of those dont match, you've got a problem waiting to bite you on
next restart.



On Fri, Oct 15, 2021 at 3:48 PM Tom Offermann 
wrote:

> So, if I were to do `CONSISTENCY ALL; select *` from each of the
> system_schema tables, then on-disk and in-memory should be in sync?
>
> On Fri, Oct 15, 2021 at 3:38 PM Jeff Jirsa  wrote:
>
>> Heap dumps + filesystem inspection + SELECT from schema tables.
>>
>>
>> On Fri, Oct 15, 2021 at 3:28 PM Tom Offermann 
>> wrote:
>>
>>> Interesting!
>>>
>>> Is there a way to determine if the on-disk schema and the in-memory
>>> schema are in sync? Is there a way to force them to sync? If so, would it
>>> help to force a sync before running an `ALTER KEYSPACE` schema change?
>>>
>>> On Fri, Oct 15, 2021 at 3:08 PM Jeff Jirsa  wrote:
>>>
>>>> I would not expect an ALTER KEYSPACE to introduce a divergent CFID,
>>>> that usually happens during a CREATE TABLE. With no other evidence or
>>>> ability to debug, I would guess that the CFIDs diverged previously, but due
>>>> to the race(s) I described, the on-disk schema and the in-memory schema
>>>> differed, and the ALTER KEYSPACE forces the schema from one host to be
>>>> serialized and forced to the others, where the actual IDs get reconciled.
>>>>
>>>> You may be able to confirm/demonstrate that by looking at the
>>>> timestamps on the data directories across all of the hosts in the cluster?
>>>>
>>>>
>>>>
>>>> On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann 
>>>> wrote:
>>>>
>>>>> Jeff,
>>>>>
>>>>> Thanks for describing the race condition.
>>>>>
>>>>> I understand that performing concurrent schema changes is dangerous,
>>>>> and that running an `ALTER KEYSPACE` on one node, and then running another
>>>>> `ALTER KEYSPACE` on a different node, before the first has fully 
>>>>> propagated
>>>>> throughout the cluster, can lead to schema collisions.
>>>>>
>>>>> But, can running a single `ALTER KEYSPACE` on a single node also be
>>>>> vulnerable to this race condition?
>>>>>
>>>>> We were careful to make sure that all nodes in both datacenters were
>>>>> on the same schema version ID by checking the output of `nodetool
>>>>> describecluster`. Since all nodes were in agreement, I figured that I had
>>>>> ruled out the possibility of concurrent schema changes.
>>>>>
>>>>> As I mentioned, on the day before, we did run 3 different `ALTER
>>>>> KEYSPACE` schema changes (to add 'dc2' to system_traces,
>>>>> system_distributed, and system_auth) and also ran `nodetool rebuild` for
>>>>> each of the 3 keyspaces. Is it possible that one or more of these schema
>>>>> changes hadn't fully propagated 24 hours later, even though `nodetool
>>>>> describecluster` showed all nodes as being on the same schema version? Is
>>>>> there a better way to determine that I am not inadvertently issuing
>>>>> concurrent schema changes?
>>>>>
>>>>> I'm also curious about how CFIDs are generated and when new ones are
>>>>> generated. What I've noticed is that when I successfully run `ALTER
>>>>> KEYSPACE` to add a datacenter with no errors (and make no other schema
>>>>> changes), then the table IDs in `system_schema.tables` remain unchanged.
>>>>> But, when we saw the schema collision that I described in this thread, 
>>>>> that
>>>>> resulted in new table IDs in `system_schema.tables`. Why do these table 
>>>>> IDs
>>>>> normally remain unchanged? What caused new ones to be generated in the
>>>>> error case I described?
>>>>>
>>>>> --Tom
>>>>>
>>>>> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa  wrote:
>>>>>
>>>>>> I've described this race a few times on the list. It is very very
>>>>>> dangerous to do concurrent table creation in cassandra wit

Re: Schema collision results in multiple data directories per table

2021-10-15 Thread Jeff Jirsa

Heap dumps + filesystem inspection + SELECT from schema tables.


On Fri, Oct 15, 2021 at 3:28 PM Tom Offermann 
wrote:

> Interesting!
>
> Is there a way to determine if the on-disk schema and the in-memory schema
> are in sync? Is there a way to force them to sync? If so, would it help to
> force a sync before running an `ALTER KEYSPACE` schema change?
>
> On Fri, Oct 15, 2021 at 3:08 PM Jeff Jirsa  wrote:
>
>> I would not expect an ALTER KEYSPACE to introduce a divergent CFID, that
>> usually happens during a CREATE TABLE. With no other evidence or ability to
>> debug, I would guess that the CFIDs diverged previously, but due to the
>> race(s) I described, the on-disk schema and the in-memory schema differed,
>> and the ALTER KEYSPACE forces the schema from one host to be serialized and
>> forced to the others, where the actual IDs get reconciled.
>>
>> You may be able to confirm/demonstrate that by looking at the timestamps
>> on the data directories across all of the hosts in the cluster?
>>
>>
>>
>> On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann 
>> wrote:
>>
>>> Jeff,
>>>
>>> Thanks for describing the race condition.
>>>
>>> I understand that performing concurrent schema changes is dangerous, and
>>> that running an `ALTER KEYSPACE` on one node, and then running another
>>> `ALTER KEYSPACE` on a different node, before the first has fully propagated
>>> throughout the cluster, can lead to schema collisions.
>>>
>>> But, can running a single `ALTER KEYSPACE` on a single node also be
>>> vulnerable to this race condition?
>>>
>>> We were careful to make sure that all nodes in both datacenters were on
>>> the same schema version ID by checking the output of `nodetool
>>> describecluster`. Since all nodes were in agreement, I figured that I had
>>> ruled out the possibility of concurrent schema changes.
>>>
>>> As I mentioned, on the day before, we did run 3 different `ALTER
>>> KEYSPACE` schema changes (to add 'dc2' to system_traces,
>>> system_distributed, and system_auth) and also ran `nodetool rebuild` for
>>> each of the 3 keyspaces. Is it possible that one or more of these schema
>>> changes hadn't fully propagated 24 hours later, even though `nodetool
>>> describecluster` showed all nodes as being on the same schema version? Is
>>> there a better way to determine that I am not inadvertently issuing
>>> concurrent schema changes?
>>>
>>> I'm also curious about how CFIDs are generated and when new ones are
>>> generated. What I've noticed is that when I successfully run `ALTER
>>> KEYSPACE` to add a datacenter with no errors (and make no other schema
>>> changes), then the table IDs in `system_schema.tables` remain unchanged.
>>> But, when we saw the schema collision that I described in this thread, that
>>> resulted in new table IDs in `system_schema.tables`. Why do these table IDs
>>> normally remain unchanged? What caused new ones to be generated in the
>>> error case I described?
>>>
>>> --Tom
>>>
>>> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa  wrote:
>>>
>>>> I've described this race a few times on the list. It is very very
>>>> dangerous to do concurrent table creation in cassandra with
>>>> non-determistic CFIDs.
>>>>
>>>> I'll try to describe it quickly right now:
>>>> - Imagine you have 3 hosts, A B and C
>>>>
>>>> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS".
>>>> A allocates a CFID (which is a UUID, which includes a high resolution
>>>> timestamp), starts adjusting it's schema
>>>> Before it can finish that schema, you connect to B and issue the same
>>>> CREATE TABLE statement
>>>> B allocates a DIFFERENT CFID, and starts adjusting its schema
>>>>
>>>> A and B both have a CFID, which they will use to make a data directory
>>>> on disk, and which they will push/pull to the rest of the cluster through
>>>> schema propagation.
>>>>
>>>> The later CFID will be saved in the schema, because the schema is a
>>>> normal cassandra table with last-write-wins semantics, but the first CFID
>>>> might be the one that's used to create the data directory on disk, and it
>>>> may have all of your data in it while you write to the table.
>>>>
>>>> In some cases, you'll get CFID mismatch e

Re: Schema collision results in multiple data directories per table

2021-10-15 Thread Jeff Jirsa

I would not expect an ALTER KEYSPACE to introduce a divergent CFID, that
usually happens during a CREATE TABLE. With no other evidence or ability to
debug, I would guess that the CFIDs diverged previously, but due to the
race(s) I described, the on-disk schema and the in-memory schema differed,
and the ALTER KEYSPACE forces the schema from one host to be serialized and
forced to the others, where the actual IDs get reconciled.

You may be able to confirm/demonstrate that by looking at the timestamps on
the data directories across all of the hosts in the cluster?



On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann 
wrote:

> Jeff,
>
> Thanks for describing the race condition.
>
> I understand that performing concurrent schema changes is dangerous, and
> that running an `ALTER KEYSPACE` on one node, and then running another
> `ALTER KEYSPACE` on a different node, before the first has fully propagated
> throughout the cluster, can lead to schema collisions.
>
> But, can running a single `ALTER KEYSPACE` on a single node also be
> vulnerable to this race condition?
>
> We were careful to make sure that all nodes in both datacenters were on
> the same schema version ID by checking the output of `nodetool
> describecluster`. Since all nodes were in agreement, I figured that I had
> ruled out the possibility of concurrent schema changes.
>
> As I mentioned, on the day before, we did run 3 different `ALTER KEYSPACE`
> schema changes (to add 'dc2' to system_traces, system_distributed, and
> system_auth) and also ran `nodetool rebuild` for each of the 3 keyspaces.
> Is it possible that one or more of these schema changes hadn't fully
> propagated 24 hours later, even though `nodetool describecluster` showed
> all nodes as being on the same schema version? Is there a better way to
> determine that I am not inadvertently issuing concurrent schema changes?
>
> I'm also curious about how CFIDs are generated and when new ones are
> generated. What I've noticed is that when I successfully run `ALTER
> KEYSPACE` to add a datacenter with no errors (and make no other schema
> changes), then the table IDs in `system_schema.tables` remain unchanged.
> But, when we saw the schema collision that I described in this thread, that
> resulted in new table IDs in `system_schema.tables`. Why do these table IDs
> normally remain unchanged? What caused new ones to be generated in the
> error case I described?
>
> --Tom
>
> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa  wrote:
>
>> I've described this race a few times on the list. It is very very
>> dangerous to do concurrent table creation in cassandra with
>> non-determistic CFIDs.
>>
>> I'll try to describe it quickly right now:
>> - Imagine you have 3 hosts, A B and C
>>
>> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS".
>> A allocates a CFID (which is a UUID, which includes a high resolution
>> timestamp), starts adjusting it's schema
>> Before it can finish that schema, you connect to B and issue the same
>> CREATE TABLE statement
>> B allocates a DIFFERENT CFID, and starts adjusting its schema
>>
>> A and B both have a CFID, which they will use to make a data directory on
>> disk, and which they will push/pull to the rest of the cluster through
>> schema propagation.
>>
>> The later CFID will be saved in the schema, because the schema is a
>> normal cassandra table with last-write-wins semantics, but the first CFID
>> might be the one that's used to create the data directory on disk, and it
>> may have all of your data in it while you write to the table.
>>
>> In some cases, you'll get CFID mismatch errors on reads or writes, as the
>> CFID in memory varies between instances.
>> In other cases, things work fine until you restart, at which time the
>> CFID for the table changes when you load the new schema, and data on disk
>> isn't found.
>>
>> This race, unfortunately, can even occur on a single node in SOME
>> versions of Cassandra (but not all)
>>
>> This is a really really really bad race in many old versions of
>> cassandra, and a lot of the schema redesign in 4.0 is meant to solve many
>> of these types of problems.
>>
>> That this continues to be possible in old versions is scary, people
>> running old versions should not do concurrent schema changes (especially
>> those that CREATE tables). Alternatively, you should alert if the CFID in
>> memory doesnt match the CFID in the disk path. One could also change
>> cassandra to use deterministic CFIDs  to avoid the race entirely (though
>> deterministic CFIDs have a different problem, w

Re: Single node slowing down queries in a large cluster

2021-10-13 Thread Jeff Jirsa

Some random notes, not necessarily going to help you, but:
- You probably have vnodes enable, which means one bad node is PROBABLY a
replica of almost every other node, so the fanout here is worse than it
should be, and
- You probably have speculative retry on the table set to a percentile. As
the host gets slow, the percentiles change, and speculative retry stop
being useful, so you end up timing out queries

If you change speculative retry to use the MIN(Xms, p99) syntax, with X set
on your real workload, you can likely force it to speculate sooner when
that one host gets sick.

The harder thing to solve is a bad coordinator node slowing down all reads
coordinated by that node. Retry at the client level to work around that
tends to be effective.

On Wed, Oct 13, 2021 at 2:22 PM S G  wrote:

> Hello,
>
> We have frequently seen that a single bad node running slow can affect the
> latencies of the entire cluster (especially for queries where the slow node
> was acting as a coordinator).
>
>
> Is there any suggestion to avoid this behavior?
>
> Like something on the client side to not query that bad node or something
> on the bad node that redirects its query to other healthy coordinators?
>
>
> Thanks,
>
>
>

Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-13 Thread Jeff Jirsa

Convention in the yaml is default being visible commented out.


On Wed, Oct 13, 2021 at 2:17 PM S G  wrote:

> ok, the link given has the value commented, so I was a bit confused.
> But then https://github.com/apache/cassandra/search?q=cross_node_timeout
> shows that default value is indeed true.
> Thanks for the help,
>
> On Wed, Oct 13, 2021 at 11:26 AM Jeff Jirsa  wrote:
>
>> The default is true:
>>
>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1000
>>
>> There is no equivalent to `alter system kill session`, because it is
>> assumed that any query has a short, finite life in the order of seconds.
>>
>>
>>
>> On Wed, Oct 13, 2021 at 11:10 AM S G  wrote:
>>
>>> Hello,
>>>
>>> Does anyone know about the default being turned off for this setting?
>>> It seems like a good one to be turned on - why have replicas process
>>> something for which coordinator has already sent the timeout to client?
>>>
>>> Thanks
>>>
>>> On Tue, Oct 12, 2021 at 11:06 AM S G  wrote:
>>>
>>>> Thanks Bowen.
>>>> Any idea why is cross_node_timeout commented out by default? That seems
>>>> like a good option to enable even as per the documentation:
>>>> # If disabled, replicas will assume that requests
>>>> # were forwarded to them instantly by the coordinator, which means that
>>>> # under overload conditions we will waste that much extra time
>>>> processing
>>>> # already-timed-out requests.
>>>>
>>>> Also, taking an example from Oracle kind of RDBMS systems, is there a
>>>> command like the following that can be fired from an external script to
>>>> kill a long running query on each node:
>>>>
>>>> alter system kill session
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Oct 12, 2021 at 10:49 AM Bowen Song  wrote:
>>>>
>>>>> That will depend on whether you have cross_node_timeout enabled.
>>>>> However, I have to point out that set timeout to 15ms is perhaps not a 
>>>>> good
>>>>> idea, the JVM GC can easily cause a lots of timeouts.
>>>>> On 12/10/2021 18:20, S G wrote:
>>>>>
>>>>> ok, when a coordinator node sends timeout to the client, does it mean
>>>>> all the replica nodes have stopped processing that specific query too?
>>>>> Or is it just the coordinator node that has stopped waiting for the
>>>>> replicas to return response?
>>>>>
>>>>> On Tue, Oct 12, 2021 at 10:12 AM Jeff Jirsa  wrote:
>>>>>
>>>>>> It sends an exception to the client, it doesnt sever the connection.
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 12, 2021 at 10:06 AM S G 
>>>>>> wrote:
>>>>>>
>>>>>>> Do the timeout values only kill the connection with the client or
>>>>>>> send error to the client?
>>>>>>> Or do they also kill the corresponding query execution happening on
>>>>>>> the Cassandra servers (co-ordinator, replicas etc) ?
>>>>>>>
>>>>>>> On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The read and write timeout values do this today.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 12, 2021 at 9:53 AM S G 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Is there a way to stop long running queries in Cassandra (versions
>>>>>>>>> 3.11.x or 4.x) ?
>>>>>>>>> The use-case is to have some kind of a circuit breaker based on
>>>>>>>>> query-time that has exceeded the client's SLAs.
>>>>>>>>> Example: If server response is useless to the client after 10 ms,
>>>>>>>>> then we could
>>>>>>>>> have a *query_killing_timeout* set to 15 ms (where additional 5ms
>>>>>>>>> allows for some buffer).
>>>>>>>>> And when that much time has elapsed, Cassandra will kill the query
>>>>>>>>> execution automatically.
>>>>>>>>>
>>>>>>>>> If this is not possible in Cassandra currently, any chance we can
>>>>>>>>> do it outside of Cassandra, like
>>>>>>>>> a shell script that monitors such long running queries (through
>>>>>>>>> users table etc) and kills the
>>>>>>>>> OS-thread responsible for that query (Looks unsafe though as that
>>>>>>>>> might leave the DB in an inconsistent state) ?
>>>>>>>>>
>>>>>>>>> We are trying this as a proactive measure to safeguard our
>>>>>>>>> clusters from any rogue queries fired accidentally or maliciously.
>>>>>>>>>
>>>>>>>>> Thanks !
>>>>>>>>>
>>>>>>>>>

Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-13 Thread Jeff Jirsa

The default is true:

https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1000

There is no equivalent to `alter system kill session`, because it is
assumed that any query has a short, finite life in the order of seconds.



On Wed, Oct 13, 2021 at 11:10 AM S G  wrote:

> Hello,
>
> Does anyone know about the default being turned off for this setting?
> It seems like a good one to be turned on - why have replicas process
> something for which coordinator has already sent the timeout to client?
>
> Thanks
>
> On Tue, Oct 12, 2021 at 11:06 AM S G  wrote:
>
>> Thanks Bowen.
>> Any idea why is cross_node_timeout commented out by default? That seems
>> like a good option to enable even as per the documentation:
>> # If disabled, replicas will assume that requests
>> # were forwarded to them instantly by the coordinator, which means that
>> # under overload conditions we will waste that much extra time processing
>> # already-timed-out requests.
>>
>> Also, taking an example from Oracle kind of RDBMS systems, is there a
>> command like the following that can be fired from an external script to
>> kill a long running query on each node:
>>
>> alter system kill session
>>
>>
>>
>>
>> On Tue, Oct 12, 2021 at 10:49 AM Bowen Song  wrote:
>>
>>> That will depend on whether you have cross_node_timeout enabled.
>>> However, I have to point out that set timeout to 15ms is perhaps not a good
>>> idea, the JVM GC can easily cause a lots of timeouts.
>>> On 12/10/2021 18:20, S G wrote:
>>>
>>> ok, when a coordinator node sends timeout to the client, does it mean
>>> all the replica nodes have stopped processing that specific query too?
>>> Or is it just the coordinator node that has stopped waiting for the
>>> replicas to return response?
>>>
>>> On Tue, Oct 12, 2021 at 10:12 AM Jeff Jirsa  wrote:
>>>
>>>> It sends an exception to the client, it doesnt sever the connection.
>>>>
>>>>
>>>> On Tue, Oct 12, 2021 at 10:06 AM S G  wrote:
>>>>
>>>>> Do the timeout values only kill the connection with the client or send
>>>>> error to the client?
>>>>> Or do they also kill the corresponding query execution happening on
>>>>> the Cassandra servers (co-ordinator, replicas etc) ?
>>>>>
>>>>> On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa  wrote:
>>>>>
>>>>>> The read and write timeout values do this today.
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 12, 2021 at 9:53 AM S G 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Is there a way to stop long running queries in Cassandra (versions
>>>>>>> 3.11.x or 4.x) ?
>>>>>>> The use-case is to have some kind of a circuit breaker based on
>>>>>>> query-time that has exceeded the client's SLAs.
>>>>>>> Example: If server response is useless to the client after 10 ms,
>>>>>>> then we could
>>>>>>> have a *query_killing_timeout* set to 15 ms (where additional 5ms
>>>>>>> allows for some buffer).
>>>>>>> And when that much time has elapsed, Cassandra will kill the query
>>>>>>> execution automatically.
>>>>>>>
>>>>>>> If this is not possible in Cassandra currently, any chance we can do
>>>>>>> it outside of Cassandra, like
>>>>>>> a shell script that monitors such long running queries (through
>>>>>>> users table etc) and kills the
>>>>>>> OS-thread responsible for that query (Looks unsafe though as that
>>>>>>> might leave the DB in an inconsistent state) ?
>>>>>>>
>>>>>>> We are trying this as a proactive measure to safeguard our clusters
>>>>>>> from any rogue queries fired accidentally or maliciously.
>>>>>>>
>>>>>>> Thanks !
>>>>>>>
>>>>>>>

Re: Schema collision results in multiple data directories per table

2021-10-13 Thread Jeff Jirsa

I've described this race a few times on the list. It is very very dangerous
to do concurrent table creation in cassandra with non-determistic CFIDs.

I'll try to describe it quickly right now:
- Imagine you have 3 hosts, A B and C

You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS".
A allocates a CFID (which is a UUID, which includes a high resolution
timestamp), starts adjusting it's schema
Before it can finish that schema, you connect to B and issue the same
CREATE TABLE statement
B allocates a DIFFERENT CFID, and starts adjusting its schema

A and B both have a CFID, which they will use to make a data directory on
disk, and which they will push/pull to the rest of the cluster through
schema propagation.

The later CFID will be saved in the schema, because the schema is a normal
cassandra table with last-write-wins semantics, but the first CFID might be
the one that's used to create the data directory on disk, and it may have
all of your data in it while you write to the table.

In some cases, you'll get CFID mismatch errors on reads or writes, as the
CFID in memory varies between instances.
In other cases, things work fine until you restart, at which time the CFID
for the table changes when you load the new schema, and data on disk isn't
found.

This race, unfortunately, can even occur on a single node in SOME versions
of Cassandra (but not all)

This is a really really really bad race in many old versions of cassandra,
and a lot of the schema redesign in 4.0 is meant to solve many of these
types of problems.

That this continues to be possible in old versions is scary, people running
old versions should not do concurrent schema changes (especially those that
CREATE tables). Alternatively, you should alert if the CFID in memory
doesnt match the CFID in the disk path. One could also change cassandra to
use deterministic CFIDs  to avoid the race entirely (though deterministic
CFIDs have a different problem, which is that DROP + re-CREATE with any
host down potentially allows data on that down host to come back when the
host comes back online).

Stronger cluster metadata starts making this much safer, so looking forward
to seeing that in future releases.

On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis 
wrote:

> You ran the `alter keyspace` command on the original dc1 nodes or the new
> dc2 nodes?
>
> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
>
>> Hi Tom,
>>
>> while I am not completely sure what might cause your issue, I just
>> want to highlight that schema agreements were overhauled in 4.0 (1) a
>> lot so that may be somehow related to what that ticket was trying to
>> fix.
>>
>> Regards
>>
>> (1) https://issues.apache.org/jira/browse/CASSANDRA-15158
>>
>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann 
>> wrote:
>> >
>> > When adding a datacenter to a keyspace (following the Last Pickle [Data
>> Center Switch][lp] playbook), I ran into a "Configuration exception merging
>> remote schema" error. The nodes in one datacenter didn't converge to the
>> new schema version, and after restarting them, I saw the symptoms described
>> in this Datastax article on [Fixing a table schema collision][ds], where
>> there were two data directories for each table in the keyspace on the nodes
>> that didn't converge. I followed the recovery steps in the Datastax article
>> to move the data from the older directories to the new directories, ran
>> `nodetool refresh`, and that fixed the problem.
>> >
>> > [lp]: https://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>> > [ds]:
>> https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html
>> >
>> > While the Datastax article was super helpful for helping me recover,
>> I'm left wondering *why* this happened. If anyone can shed some light on
>> that, or offer advice on how I can avoid getting in this situation in the
>> future, I would be most appreciative. I'll describe the steps I took in
>> more detail in the thread.
>> >
>> > ## Steps
>> >
>> > 1. The day before, I had added the second datacenter ('dc2') to the
>> system_traces, system_distributed, and system_auth keyspaces and ran
>> `nodetool rebuild` for each of the 3 keyspaces. All of that went smoothly
>> with no issues.
>> >
>> > 2. For a large keyspace, I added the second datacenter ('dc2') with an
>> `ALTER KEYSPACE foo WITH replication = {'class': 'NetworkTopologyStrategy',
>> 'dc1': '2', 'dc2': '3'};` statement. Immediately, I saw this error in the
>> log:
>> > ```
>> > "ERROR 16:45:47 Exception in thread Thread[MigrationStage:1,5,main]"
>> > "org.apache.cassandra.exceptions.ConfigurationException: Column
>> family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected
>> 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>> > "\tat
>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> > "\tat
>> org.apache.cassandr

Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread Jeff Jirsa

It sends an exception to the client, it doesnt sever the connection.


On Tue, Oct 12, 2021 at 10:06 AM S G  wrote:

> Do the timeout values only kill the connection with the client or send
> error to the client?
> Or do they also kill the corresponding query execution happening on the
> Cassandra servers (co-ordinator, replicas etc) ?
>
> On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa  wrote:
>
>> The read and write timeout values do this today.
>>
>>
>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943
>>
>>
>> On Tue, Oct 12, 2021 at 9:53 AM S G  wrote:
>>
>>> Hello,
>>>
>>> Is there a way to stop long running queries in Cassandra (versions
>>> 3.11.x or 4.x) ?
>>> The use-case is to have some kind of a circuit breaker based on
>>> query-time that has exceeded the client's SLAs.
>>> Example: If server response is useless to the client after 10 ms, then
>>> we could
>>> have a *query_killing_timeout* set to 15 ms (where additional 5ms allows
>>> for some buffer).
>>> And when that much time has elapsed, Cassandra will kill the query
>>> execution automatically.
>>>
>>> If this is not possible in Cassandra currently, any chance we can do it
>>> outside of Cassandra, like
>>> a shell script that monitors such long running queries (through users
>>> table etc) and kills the
>>> OS-thread responsible for that query (Looks unsafe though as that might
>>> leave the DB in an inconsistent state) ?
>>>
>>> We are trying this as a proactive measure to safeguard our clusters from
>>> any rogue queries fired accidentally or maliciously.
>>>
>>> Thanks !
>>>
>>>

Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread Jeff Jirsa

The read and write timeout values do this today.

https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943


On Tue, Oct 12, 2021 at 9:53 AM S G  wrote:

> Hello,
>
> Is there a way to stop long running queries in Cassandra (versions 3.11.x
> or 4.x) ?
> The use-case is to have some kind of a circuit breaker based on query-time
> that has exceeded the client's SLAs.
> Example: If server response is useless to the client after 10 ms, then we
> could
> have a *query_killing_timeout* set to 15 ms (where additional 5ms allows
> for some buffer).
> And when that much time has elapsed, Cassandra will kill the query
> execution automatically.
>
> If this is not possible in Cassandra currently, any chance we can do it
> outside of Cassandra, like
> a shell script that monitors such long running queries (through users
> table etc) and kills the
> OS-thread responsible for that query (Looks unsafe though as that might
> leave the DB in an inconsistent state) ?
>
> We are trying this as a proactive measure to safeguard our clusters from
> any rogue queries fired accidentally or maliciously.
>
> Thanks !
>
>

Re: Trouble After Changing Replication Factor

2021-10-12 Thread Jeff Jirsa

The most likely explanation is that repair failed and you didnt notice.
Or that you didnt actually repair every host / every range.

Which version are you using?
How did you run repair?


On Tue, Oct 12, 2021 at 4:33 AM Isaeed Mohanna  wrote:

> Hi
>
> Yes I am sacrificing consistency to gain higher availability and faster
> speed, but my problem is not with newly inserted data that is not there for
> a very short period of time, my problem is the data that was there before
> the RF change, still do not exist in all replicas even after repair.
>
> It looks like my cluster configuration is RF3 but the data itself is still
> using RF2 and when the data is requested from the 3rd (new) replica, it
> is not there and an empty record is returned with read CL1.
>
> What can I do to force this data to be synced to all replicas as it
> should? So read CL1 request will actually return a correct result?
>
>
>
> Thanks
>
>
>
> *From:* Bowen Song 
> *Sent:* Monday, October 11, 2021 5:13 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Trouble After Changing Replication Factor
>
>
>
> You have RF=3 and both read & write CL=1, which means you are asking
> Cassandra to give up strong consistency in order to gain higher
> availability and perhaps slight faster speed, and that's what you get. If
> you want to have strong consistency, you will need to make sure (read CL +
> write CL) > RF.
>
> On 10/10/2021 11:55, Isaeed Mohanna wrote:
>
> Hi
>
> We had a cluster with 3 Nodes with Replication Factor 2 and we were using
> read with consistency Level One.
>
> We recently added a 4th node and changed the replication factor to 3,
> once this was done apps reading from DB with CL1 would receive an empty
> record, Looking around I was surprised to learn that upon changing the
> replication factor if the read request is sent to a node the should own the
> record according to the new replication factor while it still doesn’t have
> it yet then an empty record will be returned because of CL1, the record
> will be written to that node after the repair operation is over.
>
> We ran the repair operation which took days in our case (we had to change
> apps to CL2 to avoid serious data inconsistencies).
>
> Now the repair operations are over and if I revert to CL1 we are still
> getting errors that records do not exist in DB while they do, using CL2
> again it works fine.
>
> Any ideas what I am missing?
>
> Is there a way to validate that the repairs task has actually done what is
> needed and that the data is actually now replicated RF3 ?
>
> Could it it be a Cassandra Driver issue? Since if I issue the request in
> cqlsh I do get the record but I cannot know if I am hitting the replica
> that doesn’t hold the record
>
> Thanks for your help
>
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1025 matches

Mail list logo