Hi Joe, Have you considered submitting something for Community Over Code NA 2024? The CFP is still open for a few more weeks, options could be my Performance Engineering track or the Cassandra track – or both 😊
https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D Regards, Paul Brebner From: Joe Obernberger <joseph.obernber...@gmail.com> Date: Friday, 22 March 2024 at 3:19 am To: user@cassandra.apache.org <user@cassandra.apache.org> Subject: Cassandra 5.0 Beta1 - vector searching results EXTERNAL EMAIL - USE CAUTION when clicking links or attachments Hi All - I'd like to share some initial results for the vector search on Cassandra 5.0 beta1. 3 node cluster running in kubernetes; fast Netapp storage. Have a table (doc.embeddings_googleflan5tlarge) with definition: CREATE TABLE doc.embeddings_googleflant5large ( uuid text, type text, fieldname text, offset int, sourceurl text, textdata text, creationdate timestamp, embeddings vector<float, 768>, metadata boolean, source text, PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata) ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC, textdata ASC) AND additional_write_policy = '99p' AND allow_auto_snapshot = true AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND cdc = false AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND memtable = 'default' AND crc_check_chance = 1.0 AND default_time_to_live = 0 AND extensions = {} AND gc_grace_seconds = 864000 AND incremental_backups = true AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair = 'BLOCKING' AND speculative_retry = '99p'; CREATE CUSTOM INDEX ann_index_googleflant5large ON doc.embeddings_googleflant5large (embeddings) USING 'sai'; CREATE CUSTOM INDEX offset_index_googleflant5large ON doc.embeddings_googleflant5large (offset) USING 'sai'; nodetool status -r UN cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB 128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8 rack1 UN cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local 17.98 GiB 128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412 rack1 UN cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local 18.16 GiB 128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0 rack1 nodetool tablestats doc.embeddings_googleflant5large Total number of tables: 1 ---------------- Keyspace: doc Read Count: 0 Read Latency: NaN ms Write Count: 2893108 Write Latency: 326.3586520174843 ms Pending Flushes: 0 Table: embeddings_googleflant5large SSTable count: 6 Old SSTable count: 0 Max SSTable size: 5.108GiB Space used (live): 19318114423 Space used (total): 19318114423 Space used by snapshots (total): 0 Off heap memory used (total): 4874912 SSTable Compression Ratio: 0.97448 Number of partitions (estimate): 58399 Memtable cell count: 0 Memtable data size: 0 Memtable off heap memory used: 0 Memtable switch count: 16 Speculative retries: 0 Local read count: 0 Local read latency: NaN ms Local write count: 2893108 Local write latency: NaN ms Local read/write ratio: 0.00000 Pending flushes: 0 Percent repaired: 100.0 Bytes repaired: 9.066GiB Bytes unrepaired: 0B Bytes pending repair: 0B Bloom filter false positives: 7245 Bloom filter false ratio: 0.00286 Bloom filter space used: 87264 Bloom filter off heap memory used: 87216 Index summary off heap memory used: 34624 Compression metadata off heap memory used: 4753072 Compacted partition minimum bytes: 2760 Compacted partition maximum bytes: 4866323 Compacted partition mean bytes: 154523 Average live cells per slice (last five minutes): NaN Maximum live cells per slice (last five minutes): 0 Average tombstones per slice (last five minutes): NaN Maximum tombstones per slice (last five minutes): 0 Droppable tombstone ratio: 0.00000 nodetool tablehistograms doc.embeddings_googleflant5large doc/embeddings_googleflant5large histograms Percentile Read Latency Write Latency SSTables Partition Size Cell Count (micros) (micros) (bytes) 50% 0.00 0.00 0.00 105778 124 75% 0.00 0.00 0.00 182785 215 95% 0.00 0.00 0.00 379022 446 98% 0.00 0.00 0.00 545791 642 99% 0.00 0.00 0.00 654949 924 Min 0.00 0.00 0.00 2760 4 Max 0.00 0.00 0.00 4866323 5722 Running a query such as: select uuid,offset,type,textdata from doc.embeddings_googleflant5large order by embeddings ANN OF [768 dimension vector] limit 20; Works fine - typically less than 5 seconds to return. Subsequent queries are even faster. If I'm activity adding data to the table, the searches can sometimes timeout (using cqlsh). If I add something to the where clause, the performance drops significantly: select uuid,offset,type,textdata from doc.embeddings_googleflant5large where offset=1 order by embeddings ANN OF [] limit 20; That query will timeout when running in cqlsh and with no data being added to the table. We've been running a Weaviate database side-by-side with Cassandra 4, and would love to drop Weaviate if we can do all the vector searches inside of Cassandra. What else can I try? Anything to increase performance? Thanks all! -Joe -- This email has been checked for viruses by AVG antivirus software. https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2F&data=05%7C02%7CPaul.Brebner%40netapp.com%7C8aabd40ede0c42dafe9908dc49c2a581%7C4b0911a0929b4715944bc03745165b3a%7C0%7C0%7C638466347558648524%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=p0VIw5MyiqtgI1qQ22mfbcgXkxfLl1%2FS1I9zDfE1rpY%3D&reserved=0<http://www.avg.com/>