Thank you very much for helping me out on this! The table
fieldcounts is currently pretty small - 6.4 million rows.
cfstats are:
Total number of tables: 81
----------------
Keyspace : doc
       Read Count: 3713134
       Read Latency: 0.2664131157130338 ms
       Write Count: 47513045
       Write Latency: 1.0725477948634947 ms
       Pending Flushes: 0
               Table: fieldcounts
               SSTable count: 3
               Space used (live): 16010248
               Space used (total): 16010248
               Space used by snapshots (total): 0
               Off heap memory used (total): 4947
               SSTable Compression Ratio:
0.3994304032360534
               Number of partitions (estimate): 3
               Memtable cell count: 0
               Memtable data size: 0
               Memtable off heap memory used: 0
               Memtable switch count: 0
               Local read count: 379
               Local read latency: NaN ms
               Local write count: 0
               Local write latency: NaN ms
               Pending flushes: 0
               Percent repaired: 100.0
               Bloom filter false positives: 0
               Bloom filter false ratio: 0.00000
               Bloom filter space used: 48
               Bloom filter off heap memory used: 24
               Index summary off heap memory used: 51
               Compression metadata off heap memory
used: 4872
               Compacted partition minimum bytes: 8409008
               Compacted partition maximum bytes:
25109160
               Compacted partition mean bytes: 15096925
               Average live cells per slice (last
five minutes): NaN
               Maximum live cells per slice (last
five minutes): 0
               Average tombstones per slice (last
five minutes): NaN
               Maximum tombstones per slice (last
five minutes): 0
               Dropped Mutations: 0
Commitlog is on a separate spindle on the 7 node cluster. All disks
are SATA (spinning rust as they say!). This is an R&D platform, but
I will switch to NetworkTopologyStrategy. I'm using Prometheus and
Grafana to monitor Cassandra and the CPU load is typically 100 to
200% on most of the nodes. Disk IO is typically pretty low.
Performance - in general Async is about 10x faster.
ExecuteAsync:
35mSec for 364 rows.
8120mSec for 205001 rows.
14788mSec for 345001 rows.
4117mSec for 86400 rows.
23,330 rows per second on average
Execute:
232mSec for 364 rows.
584869mSec for 1263283 rows
46290mSec for 86400 rows
2,160 rows per second on average
Curious - our largest table (doc) has the following stats - is it not
partitioned well?
Total number of tables: 81
----------------
Keyspace : doc
       Read Count: 3713134
       Read Latency: 0.2664131157130338 ms
       Write Count: 47513045
       Write Latency: 1.0725477948634947 ms
       Pending Flushes: 0
               Table: doc
               SSTable count: 26
               Space used (live): 57124641753
               Space used (total): 57124641753
               Space used by snapshots (total):
113012646218
               Off heap memory used (total): 27331913
               SSTable Compression Ratio:
0.2531585373184219
               Number of partitions (estimate): 12
               Memtable cell count: 0
               Memtable data size: 0
               Memtable off heap memory used: 0
               Memtable switch count: 0
               Local read count: 27169
               Local read latency: NaN ms
               Local write count: 0
               Local write latency: NaN ms
               Pending flushes: 0
               Percent repaired: 0.0
               Bloom filter false positives: 0
               Bloom filter false ratio: 0.00000
               Bloom filter space used: 576
               Bloom filter off heap memory used: 368
               Index summary off heap memory used: 425
               Compression metadata off heap memory
used: 27331120
               Compacted partition minimum bytes: 24602
               Compacted partition maximum bytes:
63771372175
               Compacted partition mean bytes: 7052951452
               Average live cells per slice (last
five minutes): NaN
               Maximum live cells per slice (last
five minutes): 0
               Average tombstones per slice (last
five minutes): NaN
               Maximum tombstones per slice (last
five minutes): 0
               Dropped Mutations: 0
Thank again!
-Joe
On 3/12/2021 11:01 AM, Bowen Song wrote:
Sleep-then-retry works is just another indicator that it's likely a
GC pause related issue. I'd recommend you to check your Cassandra
servers' GC logs first.
Do you know what's the maximum partition size for the
doc.fieldcounts table? (Try the "nodetool cfstats doc.fieldcounts"
command) I suspect this table has large partitions, which usually
leads to GC issues.
As of your failed executeAsync() insert issue, do you know how many
concurrent on-the-fly queries do you have? Cassandra driver has
limitations on it, and new executeAsync() calls will fail when the
limit is reached.
I'm also a bit concerned about your "significantly" slower inserts.
Inserts (excluding "INSERT IF NOT EXISTS") should be very fast in
Cassandra. How slow are they? Are they always slow like that, or
usually fast but some are much slower than others? What does the CPU
usage & disk IO look like on the Cassandra server? Do you have
commitlog on the same disk as the data? Is it a spinning disk, SATA
SSD or NVMe?
BTW, you really shouldn't use SimpleStrategy for production
environments.
On 12/03/2021 15:18, Joe Obernberger wrote:
The queries that are failing are:
select fieldvalue, count from doc.ordered_fieldcounts where
source=? and fieldname=? limit 10
Created with:
CREATE TABLE doc.ordered_fieldcounts (
   source text,
   fieldname text,
   count bigint,
   fieldvalue text,
   PRIMARY KEY ((source, fieldname), count, fieldvalue)
) WITH CLUSTERING ORDER BY (count DESC, fieldvalue ASC)
and:
select fieldvalue, count from doc.fieldcounts where source=? and
fieldname=?
Created with:
CREATE TABLE doc.fieldcounts (
   source text,
   fieldname text,
   fieldvalue text,
   count bigint,
   PRIMARY KEY (source, fieldname, fieldvalue)
)
This really seems like a driver issue. I put retry logic around
the calls and now those queries work. Basically if it throws an
exception, I Thread.sleep(500) and then retry. This seems to be a
continuing theme with Cassandra in general. Is this common practice?
After doing this retry logic, an insert statement started failing
with an illegal state exception when I retried it (which makes
sense). This insert was using
session.executeAsync(boundStatement). I changed that to just
execute (instead of async) and now I get no errors, no retries
anywhere. The insert is *significantly* slower when running
execute vs executeAsync. When using executeAsync:
com.datastax.oss.driver.api.core.NoNodeAvailableException: No node
was available to execute the query
       at
com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40)
       at
com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
       at
com.datastax.oss.driver.internal.core.cql.MultiPageResultSet$RowIterator.maybeMoveToNextPage(MultiPageResultSet.java:99)
       at
com.datastax.oss.driver.internal.core.cql.MultiPageResultSet$RowIterator.computeNext(MultiPageResultSet.java:91)
       at
com.datastax.oss.driver.internal.core.cql.MultiPageResultSet$RowIterator.computeNext(MultiPageResultSet.java:79)
       at
com.datastax.oss.driver.internal.core.util.CountingIterator.tryToComputeNext(CountingIterator.java:91)
       at
com.datastax.oss.driver.internal.core.util.CountingIterator.hasNext(CountingIterator.java:86)
       at
com.ngc.helios.fieldanalyzer.FTAProcess.handleOrderedFieldCounts(FTAProcess.java:684)
       at
com.ngc.helios.fieldanalyzer.FTAProcess.storeResults(FTAProcess.java:214)
       at
com.ngc.helios.fieldanalyzer.FTAProcess.startProcess(FTAProcess.java:190)
       at com.ngc.helios.fieldanalyzer.Main.main(Main.java:20)
The interesting part here is the the line that is now failing (line
684 in FTAProcess) is:
if (itRs.hasNext())
where itRs is an iterator<Row> over a select query from another
table. I'm iterating over a result set from a select and
inserting those results via executeAsync.
-Joe
On 3/12/2021 9:07 AM, Bowen Song wrote:
Millions rows in a single query? That sounds like a bad idea to
me. Your "NoNodeAvailableException" could be caused by
stop-the-world GC pauses, and the GC pauses are likely caused by
the query itself.
On 12/03/2021 13:39, Joe Obernberger wrote:
Thank you Paul and Erick. The keyspace is defined like this:
CREATE KEYSPACE doc WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '3'}Â AND durable_writes
= true;
Would that cause this?
The program that is having the problem selects data, calculates
stuff, and inserts. It works with smaller selects, but when the
number of rows is in the millions, I start to get this error.Â
Since it works with smaller sets, I don't believe it to be a
network error. All the nodes are definitely up as other
processes are working OK, it's just this one program that fails.
The full stack trace:
Error: com.datastax.oss.driver.api.core.NoNodeAvailableException:
No node was available to execute the query
com.datastax.oss.driver.api.core.NoNodeAvailableException: No
node was available to execute the query
       at
com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40)
       at
com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
       at
com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
       at
com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
       at
com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
       at
com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
       at
com.abc.xxxx.fieldanalyzer.FTAProcess.udpateCassandraFTAMetrics(FTAProcess.java:275)
       at
com.abc.xxxx.fieldanalyzer.FTAProcess.storeResults(FTAProcess.java:216)
       at
com.abc.xxxx.fieldanalyzer.FTAProcess.startProcess(FTAProcess.java:199)
       at com.abc.xxxx.fieldanalyzer.Main.main(Main.java:20)
FTAProcess like 275 is:
ResultSet rs = session.execute(getFieldCounts.bind().setString(0,
rb.getSource()).setString(1, rb.getFieldName()));
-Joe
On 3/12/2021 8:30 AM, Paul Chandler wrote:
Hi Joe
This could also be caused by the replication factor of the
keyspace, if you have NetworkTopologyStrategy and it doesn’t
list a replication factor for the datacenter datacenter1 then
you will get this error message too.Â
Paul
On 12 Mar 2021, at 13:07, Erick Ramirez
<erick.rami...@datastax.com
<mailto:erick.rami...@datastax.com>> wrote:
Does it get returned by the driver every single time? The
NoNodeAvailableExceptiongets thrown when (1) all nodes are
down, or (2) all the contact points are invalid from the
driver's perspective.
Is it possible there's no route/connectivity from your app
server(s) to the 172.16.x.xnetwork? If you post the full error
message + full stacktrace, it might provide clues. Cheers!
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free. www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>