[ 
https://issues.apache.org/jira/browse/CASSANDRA-10358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987272#comment-14987272
 ] 

Andre Turgeon commented on CASSANDRA-10358:
-------------------------------------------

[~slebresne], I was not aware that {{CQLSSTableWriter}} was intended to be used 
solely as a way of generating data to be loaded via {{sstableloader}}. I was 
using it differently. I was using rsync to directly copy the output files over 
to their destination Cassandra nodes. Prior to CASSANDRA-7360, it worked. So 
from my perspective, that was a regression. Perhaps I should explain how I use 
{{CQLSSTableWriter}} a bit more clearly:
We have a Map/Reduce program (running on Hadoop) which reads terabytes of data 
and generates SSTables in parallel. Once generated, these SSTables are 
"rsync"ed to their destination nodes. The generated SSTables are already at the 
appropriate (level compaction) level which saves a lot of compaction time. 
Because the Hadoop cluster is very large, it can crunch through the data much 
more quickly than the Cassandra cluster. The bottle neck is simply the transfer 
time at that point. 
This saves a lot of time when we bulk-load data. Using this method, our dataset 
loads in about 3 hours. When I use {{sstableloader}}, again in parallel using 
Hadoop, it takes over a week for the load and compaction to finish.

> Allow CQLSSTableWriter.Builder to use custom AbstractSSTableSimpleWriter 
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10358
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10358
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Andre Turgeon
>            Priority: Minor
>         Attachments: SSTableWriterCreationStrategy.patch, patch.txt
>
>
> I've created a patch for your consideration. 
> This change to CQLSSTableWriter allows for a custom 
> AbstractSSTableSimpleWriter to be specified. 
> I needed this for a bulkload process I wrote. I believe the change would be 
> beneficial for other people as well. 
> Below are the reasons I needed a custom implementation of 
> AbstractSSTableSimpleWriter:
> 1) The available implementations of AbstractSSTableSimpleWriter do not 
> provide a way to specify the filename (or rather revision) of the sstable. I 
> needed to control the name because my bulkload process write sstables in 
> parallel (on multiple machines) and I wish to avoid name collisions.
> 2) I discovered a problem with SSTableSimpleUnsortedWriter where it creates 
> invalid level-compaction-style sstables; It allows a partition to span 2 
> sstables which violates the "no overlap of token ranges" constraint of level 
> compaction.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to