[ 
https://issues.apache.org/jira/browse/CASSANDRA-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249059#comment-17249059
 ] 

Serban Teodorescu commented on CASSANDRA-12416:
-----------------------------------------------

I think this can be solved by having a separate tool that would merge multiple 
SSTables into one, then run the SSTableLoader on the result. Something like 
[https://github.com/tolbertam/sstable-tools#compact.] It's debatable if there 
should be such a tool in Cassandra, and if so there should be a new ticket for 
this anyway.

Theoretically it would also be possible to merge the tables and stream it 
instead of writing them as a new SSTable to disk. But this would require 
refactoring the SSTableLoader, since as it is now it relies on using some table 
metadata to prepare the streaming, metadata that won't be available until the 
merging is done. 

Another point partially related to this is that in Cassandra 4 it is more 
efficient to stream SSTables that belong to a single token range (see 
https://cassandra.apache.org/blog/2019/04/09/benchmarking_streaming.html). So a 
mix of merge/split by token range would be the most efficient (or you could 
implement the split at the source, in the code that uses cqlsstablewriter) 

 

> sstableloader to stream sstables in a sorted order
> --------------------------------------------------
>
>                 Key: CASSANDRA-12416
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12416
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Legacy/Tools
>            Reporter: Zhaojun Zhang
>            Priority: Normal
>
> Within each sstable, the data is sorted. However, this is not true across 
> multiple sstables. We have a workflow which will create a read-only cluster 
> by bulk loading data from sstables (written by cqlsstablewirter) to cassandra 
> cluster. We don't want to trigger compaction, and the best way to do so is to 
> write data in a sorted order, which requires us to do a global sort across 
> all data sources using an external sort algorithm. If we are able to use 
> sstableloader to load data into clusters in order, we don't need to do such 
> global sort, which will dramatically simply our implementation and code 
> redundancy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to