Robert Knutsson created CASSANDRA-20141:
-------------------------------------------

             Summary: Unresponsive node after ingesting large amounts of vectors
                 Key: CASSANDRA-20141
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20141
             Project: Apache Cassandra
          Issue Type: Bug
            Reporter: Robert Knutsson


{*}Background{*}:

We have a Cassandra 5.0.2 cluster running on java 17, we've tried with 
everything from 3 to 23 nodes (running in AWS on r7i.4xlarge instances)

We have a table with an id column of type TEXT and another column of type 
VECTOR <FLOAT, 256>.

On that table we also have an SAI index on the VECTOR column with the options 
\{ 'similarity_function': 'EUCLIDEAN' }

*When:*

When we ingest large amounts of embeddings (~200 million) we notice each and 
every time that before all embeddings are saved a node becomes unresponsive 
(after >20 million are ingested) and no other node is unable to rejoin the 
cluster.

If the index is removed before we ingest the data, everything is able to be 
properly persisted, but once the index is added (and created successfully) the 
same thing happens again once we continue writing more embeddings to the cluster

*What:*

We saw the following stacktrace in our logs:
{noformat}
java.lang.NullPointerException: Cannot invoke 
"java.lang.Boolean.booleanValue()" because "res" is null
    at 
org.apache.cassandra.utils.memory.MemtableCleanerThread$Clean.apply(MemtableCleanerThread.java:97)
    at 
org.apache.cassandra.utils.concurrent.ListenerList$CallbackBiConsumerListener.run(ListenerList.java:244)
    at 
org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.safeExecute(ListenerList.java:166)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:157)
    at 
org.apache.cassandra.utils.concurrent.ListenerList$CallbackBiConsumerListener.notifySelf(ListenerList.java:250)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
    at 
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
    at 
org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
    at 
org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
    at 
org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
    at 
org.apache.cassandra.db.memtable.AbstractAllocatorMemtable.lambda$flushLargestMemtable$0(AbstractAllocatorMemtable.java:306)
    at 
org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.safeExecute(ListenerList.java:166)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:157)
    at 
org.apache.cassandra.utils.concurrent.ListenerList$RunnableWithExecutor.notifySelf(ListenerList.java:345)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
    at 
org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
    at 
org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
    at 
org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
    at 
org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
    at org.apache.cassandra.concurrent.FutureTask.tryFailure(FutureTask.java:87)
    at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:75)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
{noformat}
This leads me to believe the above NPE happens once the Memtables are to be 
cleaned (persisted as SSTables?) perhaps?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to