[ 
https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027407#comment-13027407
 ] 

Stu Hood commented on CASSANDRA-1278:
-------------------------------------

bq. It's totally reasonable to require tempspace for bulkload in exchange for 
an extra 2x? performance win.
There are definitely ways we can get this performance back on the server side 
(in the future) without affecting clients. In particular, we could build the 
index behind the data as it arrives: the only blocker for doing this currently 
is that we need an estimated size to start building the bloom filter, but I see 
multiple ways around that (including partitioning the filter, which has other 
benefits: see CASSANDRA-2466).

Additionally, our existing streaming protocol requires that a client be able to 
communicate out of band in our Messaging layer, where there be dragons. 
Honestly, I'd like to call Matt's protocol (plus framing and a version) 
"streaming v2".

But if you feel strongly about it, then by all means... I'm not trying to block 
progress here.

> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.8.1
>
>         Attachments: 1278-cassandra-0.7-v2.txt, 1278-cassandra-0.7.1.txt, 
> 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>          Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> Currently bulk loading into Cassandra is a black art.  People are either 
> directed to just do it responsibly with thrift or a higher level client, or 
> they have to explore the contrib/bmt example - 
> http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires 
> delving into the code to find out how it works and then applying it to the 
> given problem.  Using either method, the user also needs to keep in mind that 
> overloading the cluster is possible - which will hopefully be addressed in 
> CASSANDRA-685
> This improvement would be to create a contrib module or set of documents 
> dealing with bulk loading.  Perhaps it could include code in the Core to make 
> it more pluggable for external clients of different types.
> It is just that this is something that many that are new to Cassandra need to 
> do - bulk load their data into Cassandra.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to