[ https://issues.apache.org/jira/browse/CASSANDRA-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ke Han updated CASSANDRA-18728: ------------------------------- Description: h1. Bug Description When using Cassandra 3.11.15 to load legacy data from 2.2.10, I noticed that the byte representation of the column identifier is incorrect. The legacy data contain two tables, and the schema is as follows. {code:java} cqlsh> desc test.alpha ;CREATE TABLE test.alpha ( key text PRIMARY KEY, foo text ) WITH COMPACT STORAGE AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE';cqlsh> DESC test.foos ;CREATE TABLE test.foos ( key text PRIMARY KEY, "666f6f" text ) WITH COMPACT STORAGE AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE'; CREATE INDEX idx_foo ON test.foos ("666f6f"); {code} There exists a column in test.foo with {*}name = "666f6f"{*}, the corresponding byte representation should be Hex(666f6f) == {*}363636663666{*}. However, when 3.11.15 loads the data and creating the column, if we check the value in byteBuffer, the it still stores "666f6f". {code:java} // src/java/org/apache/cassandra/schema/SchemaKeyspace.java public static ColumnDefinition createColumnFromRow(UntypedResultSet.Row row, Types types) { String keyspace = row.getString("keyspace_name"); String table = row.getString("table_name"); ColumnDefinition.Kind kind = ColumnDefinition.Kind.valueOf(row.getString("kind").toUpperCase()); int position = row.getInt("position"); ClusteringOrder order = ClusteringOrder.valueOf(row.getString("clustering_order").toUpperCase()); AbstractType<?> type = parse(keyspace, row.getString("type"), types); if (order == ClusteringOrder.DESC) type = ReversedType.getInstance(type); logger.info(String.format("column_name = %s, column_name_bytes = %s" , row.getString("column_name"), new String(row.getBytes("column_name_bytes").array(), StandardCharsets.UTF_8))); ColumnIdentifier name = new ColumnIdentifier(row.getBytes("column_name_bytes"), row.getString("column_name")); return new ColumnDefinition(keyspace, table, name, type, position, kind); }{code} h2. Logs INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - *{color:#de350b}column_name = 666f6f, column_name_bytes = foo{color}* It should be : +column_name_bytes = {color:#172b4d}666f6f{color}+ {code:java} INFO [main] 2023-08-07 02:21:53,722 StorageService.java:773 - Populating token metadata from system tables INFO [main] 2023-08-07 02:21:53,736 StorageService.java:780 - Token metadata: Normal Tokens: localhost/127.0.0.1:[95610762103941981519101009083045058398]INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = column1, column_name_bytes = column1 INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = foo, column_name_bytes = foo INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = key, column_name_bytes = key INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = value, column_name_bytes = value INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - column_name = 666f6f, column_name_bytes = foo // Incorrect! INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - column_name = column1, column_name_bytes = column1{code} h1. Reproduce Method I have attached the data tar file, if start up Cassandra 3.11.15 with it and inject a the log statement to print out the buffer value, we can notice that the value is incorrect in the log. h1. Thoughts This is a transient bug which won't lead to exceptions or error logs. But the incorrect byte representation might lead to some critical issues. This bug shares the same triggering method with CASSANDRA-14468. I believe this bug also shares the same root cause as CASSANDRA-14468. In CASSANDRA-14468, the incorrect byte representation could lead to an upgrade exception. It was partially fixed by avoiding the intern of ColumnIdentifier (which makes this bug transient). But the real root cause remains, and it's still possible to cause other problems.\ was: When using Cassandra 3.11.15 to load legacy data from 2.2.10, I noticed that the byte representation of the column identifier is incorrect. The legacy data contain two tables, and the schema is as follows. {code:java} cqlsh> desc test.alpha ;CREATE TABLE test.alpha ( key text PRIMARY KEY, foo text ) WITH COMPACT STORAGE AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE';cqlsh> DESC test.foos ;CREATE TABLE test.foos ( key text PRIMARY KEY, "666f6f" text ) WITH COMPACT STORAGE AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE'; CREATE INDEX idx_foo ON test.foos ("666f6f"); {code} There exists a column in test.foo with {*}name = "666f6f"{*}, the corresponding byte representation should be Hex(666f6f) == {*}363636663666{*}. However, when 3.11.15 loads the data and creating the column, if we check the value in byteBuffer, the it still stores "666f6f". {code:java} // src/java/org/apache/cassandra/schema/SchemaKeyspace.java public static ColumnDefinition createColumnFromRow(UntypedResultSet.Row row, Types types) { String keyspace = row.getString("keyspace_name"); String table = row.getString("table_name"); ColumnDefinition.Kind kind = ColumnDefinition.Kind.valueOf(row.getString("kind").toUpperCase()); int position = row.getInt("position"); ClusteringOrder order = ClusteringOrder.valueOf(row.getString("clustering_order").toUpperCase()); AbstractType<?> type = parse(keyspace, row.getString("type"), types); if (order == ClusteringOrder.DESC) type = ReversedType.getInstance(type); logger.info(String.format("column_name = %s, column_name_bytes = %s" , row.getString("column_name"), new String(row.getBytes("column_name_bytes").array(), StandardCharsets.UTF_8))); ColumnIdentifier name = new ColumnIdentifier(row.getBytes("column_name_bytes"), row.getString("column_name")); return new ColumnDefinition(keyspace, table, name, type, position, kind); }{code} h2. Logs INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - *{color:#de350b}column_name = 666f6f, column_name_bytes = foo{color}* It should be : +column_name_bytes = {color:#172b4d}666f6f{color}+ {code:java} INFO [main] 2023-08-07 02:21:53,722 StorageService.java:773 - Populating token metadata from system tables INFO [main] 2023-08-07 02:21:53,736 StorageService.java:780 - Token metadata: Normal Tokens: localhost/127.0.0.1:[95610762103941981519101009083045058398]INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = column1, column_name_bytes = column1 INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = foo, column_name_bytes = foo INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = key, column_name_bytes = key INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = value, column_name_bytes = value INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - column_name = 666f6f, column_name_bytes = foo // Incorrect! INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - column_name = column1, column_name_bytes = column1{code} h1. Reproduce Method I have attached the data tar file, if start up Cassandra 3.11.15 with it and inject a the log statement to print out the buffer value, we can notice that the value is incorrect in the log. h1. Thoughts This is a transient bug which won't lead to exceptions or error logs. But the incorrect byte representation might lead to some critical issues. This bug shares the same triggering method with CASSANDRA-14468. I believe this bug also shares the same root cause as CASSANDRA-14468. In CASSANDRA-14468, the incorrect byte representation could lead to an upgrade exception. It was partially fixed by avoiding the intern of ColumnIdentifier (which makes this bug transient). But the real root cause remains, and it's still possible to cause other problems.\ > [Transient Bug] Incorrect ByteBuffer representation of ColumnIdentifiers when > 3.11.15 loading legacy data from 2.x > ------------------------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-18728 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18728 > Project: Cassandra > Issue Type: Bug > Reporter: Ke Han > Priority: Normal > Attachments: data.tar.gz, system.log > > > h1. Bug Description > When using Cassandra 3.11.15 to load legacy data from 2.2.10, I noticed that > the byte representation of the column identifier is incorrect. > The legacy data contain two tables, and the schema is as follows. > {code:java} > cqlsh> desc test.alpha ;CREATE TABLE test.alpha ( > key text PRIMARY KEY, > foo text > ) WITH COMPACT STORAGE > AND bloom_filter_fp_chance = 0.01 > AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' > AND comment = '' > AND compaction = {'min_threshold': '4', 'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32'} > AND compression = {'sstable_compression': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = 'NONE';cqlsh> DESC test.foos ;CREATE TABLE > test.foos ( > key text PRIMARY KEY, > "666f6f" text > ) WITH COMPACT STORAGE > AND bloom_filter_fp_chance = 0.01 > AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' > AND comment = '' > AND compaction = {'min_threshold': '4', 'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32'} > AND compression = {'sstable_compression': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = 'NONE'; > CREATE INDEX idx_foo ON test.foos ("666f6f"); {code} > There exists a column in test.foo with {*}name = "666f6f"{*}, the > corresponding byte representation should be Hex(666f6f) == > {*}363636663666{*}. However, when 3.11.15 loads the data and creating the > column, if we check the value in byteBuffer, the it still stores "666f6f". > {code:java} > // src/java/org/apache/cassandra/schema/SchemaKeyspace.java > public static ColumnDefinition createColumnFromRow(UntypedResultSet.Row row, > Types types) > { > String keyspace = row.getString("keyspace_name"); > String table = row.getString("table_name"); > ColumnDefinition.Kind kind = > ColumnDefinition.Kind.valueOf(row.getString("kind").toUpperCase()); > int position = row.getInt("position"); > ClusteringOrder order = > ClusteringOrder.valueOf(row.getString("clustering_order").toUpperCase()); > AbstractType<?> type = parse(keyspace, row.getString("type"), types); > if (order == ClusteringOrder.DESC) > type = ReversedType.getInstance(type); > logger.info(String.format("column_name = %s, column_name_bytes = %s" , > row.getString("column_name"), new > String(row.getBytes("column_name_bytes").array(), StandardCharsets.UTF_8))); > ColumnIdentifier name = new > ColumnIdentifier(row.getBytes("column_name_bytes"), > row.getString("column_name")); > return new ColumnDefinition(keyspace, table, name, type, position, kind); > }{code} > h2. Logs > INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - > *{color:#de350b}column_name = 666f6f, column_name_bytes = foo{color}* > It should be : +column_name_bytes = {color:#172b4d}666f6f{color}+ > {code:java} > INFO [main] 2023-08-07 02:21:53,722 StorageService.java:773 - Populating > token metadata from system tables > INFO [main] 2023-08-07 02:21:53,736 StorageService.java:780 - Token > metadata: Normal Tokens: > localhost/127.0.0.1:[95610762103941981519101009083045058398]INFO [main] > 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = column1, > column_name_bytes = column1 > INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = > foo, column_name_bytes = foo > INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = > key, column_name_bytes = key > INFO [main] 2023-08-07 02:21:53,756 SchemaKeyspace.java:1136 - column_name = > value, column_name_bytes = value > INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - column_name = > 666f6f, column_name_bytes = foo // Incorrect! > INFO [main] 2023-08-07 02:21:53,762 SchemaKeyspace.java:1136 - column_name = > column1, column_name_bytes = column1{code} > h1. Reproduce Method > I have attached the data tar file, if start up Cassandra 3.11.15 with it and > inject a the log statement to print out the buffer value, we can notice that > the value is incorrect in the log. > h1. Thoughts > This is a transient bug which won't lead to exceptions or error logs. But the > incorrect byte representation might lead to some critical issues. > This bug shares the same triggering method with CASSANDRA-14468. I believe > this bug also shares the same root cause as CASSANDRA-14468. In > CASSANDRA-14468, the incorrect byte representation could lead to an upgrade > exception. It was partially fixed by avoiding the intern of ColumnIdentifier > (which makes this bug transient). > But the real root cause remains, and it's still possible to cause other > problems.\ > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org