Hi Bogdan, First - I am probably captain obvious here - but you will lose some data in the process. as *interaction_timestamp *is not part of the primary key anymore you will upsert the new row multiple times, last one win.
The CSV rows are loaded in parallel with multiple threads and multiple chunks as such the order is* NOT guaranteed* indeed. I have to go back to older documentation to get the proper description : COPY FROM loads rows from a CSV file in a parallel non-deterministic order. https://docs.datastax.com/en/cql-oss/3.1/cql/cql_reference/copy_r.html You can still play with CHUNKSIZE but this is not want you want. https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshCopy.html Solution here would be to work on in the dump I am afraid (items are ordered in first table and keep only latest *interaction_timestamp)* Cedrick On Tue, Feb 11, 2020 at 1:17 PM Bogdan Gherca <bogdan.ghe...@gmail.com> wrote: > Hey Cassandra folks, > > I'm trying to change the schema of an existing table by creating a new one > and migrating the data. > > The initial table schema looks like this: > > > > > > > *CREATE TABLE IF NOT EXISTS initial_table ( user_id > text, message_id timeuuid, interaction_state > text, interaction_timestamp timestamp, PRIMARY KEY ((user_id), > message_id, interaction_state, interaction_timestamp));* > > We're trying to remove interaction timestamp from the PK - same schema but > with *PRIMARY KEY ((user_id), message_id, interaction_state)* > > When importing the .csv dump obtained from the *initial_table, *the > timestamp column seems to be written in a weird way. Multiple rows from the > old schema need to be merged to a single entry of the new schema. For most > cases, it seems the last entry entry gets copied over to the new table > while for others a random one gets copied. Check out the below csv sample > and the copy from result. > > *initial_table_dump.csv* > > *123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03 > 17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-10 > 00:05:41+0000* > > *copy new_table(user_id, message_id, interaction_state, > interaction_timestamp) from '~/initial_table_dump.csv';* > > *Result:* > > > > *user_id | message_id | interaction_state | > interaction_timestamp---------+--------------------------------------+-------------------+-------------------------- > 123 > | ed6c69a0-0add-11b2-8080-808080808080 | DISMISSED | 2020-01-03 > 17:50:59+0000* > > Notice the first row from the csv gets written into the new table in this > case - here there are only two rows, but for multiple ones it seems a > random one would be copied over, not the first/last one necessarily. When > updating the interaction_timestamp column value as below, it seems to copy > the latest entry to the new table. > > *initial_table_dump_2.csv* > > *123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03 > 17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-05 > 00:05:41+0000* > > *- perform same copy from operation - * > > *Result:* > > > *user_id | message_id | interaction_state | > interaction_timestamp---------+--------------------------------------+-------------------+-------------------------- > 123 > | ed6c69a0-0add-11b2-8080-808080808080 | DISMISSED | **2020-01-05 > 00:05:41+0000* > > Could someone help me understand why this might happen? Does the 'copy > from' follow the order from the csv when doing the import or there are no > order guarantees? > > I'm using the below cqlsh and Cassandra versions: > *[cqlsh 5.0.1 | Cassandra 2.2.15 | CQL spec 3.3.1 | Native protocol v4]* > > Thanks, > Bogdan > -- Cédrick Lunven *EMEA Developer Advocate Manager * 🎓Free Trainings : *DataStax Academy <https://academy.datastax.com/>* ❓Ask us your questions : *DataStax Community <https://community.datastax.com/index.html>* 🔬Test our new products : *DataStax Labs <https://downloads.datastax.com/#labs>*