[jira] [Commented] (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Jonathan Ellis (JIRA) Tue, 03 May 2011 20:36:48 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028574#comment-13028574
 ]


Jonathan Ellis commented on CASSANDRA-1278:
-------------------------------------------

bq. One of the main goals of the bulk loading was that no local/temp storage 
was required on the client; that has been the plan from the beginning

No, it hasn't.

But we can leave that aside for now; we already have "build everything else 
from the sstable bits" code, so we can add "take advantage of local storage to 
offload that from the server" later as an optimization.

bq. deprecate sessions all together

You're going to need some kind "when all of this is done, run this callback" 
construct for bootstrap/node movement. Currently we call that a Session.

bq. When node A wants to send things to node B, it records that fact in the 
system table. For each entry it sends the file using the bulk loading protocol 
and continues retrying until the file is excepted.

Sounds exactly like what existing streaming does.

bq. The only complex part is preventing removal of the SSTable on the source

Currently we do this by simply maintaining a reference to the SSTR object so 
the GC doesn't delete it. There's no need to make it more complicated than that.

I took a look at the patch.  Just superficially, there's a lot of gratuitous 
change in there, e.g., refactoring test_thrift_server.py.  Those changes also 
need to be moved to a separate patch (again, I suggest git) so reviewers can 
easily distinguish refactoring from ticket-specific changes.

> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.8.1
>
>         Attachments: 1278-cassandra-0.7-v2.txt, 1278-cassandra-0.7.1.txt, 
> 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>          Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> Currently bulk loading into Cassandra is a black art.  People are either 
> directed to just do it responsibly with thrift or a higher level client, or 
> they have to explore the contrib/bmt example - 
> http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires 
> delving into the code to find out how it works and then applying it to the 
> given problem.  Using either method, the user also needs to keep in mind that 
> overloading the cluster is possible - which will hopefully be addressed in 
> CASSANDRA-685
> This improvement would be to create a contrib module or set of documents 
> dealing with bulk loading.  Perhaps it could include code in the Core to make 
> it more pluggable for external clients of different types.
> It is just that this is something that many that are new to Cassandra need to 
> do - bulk load their data into Cassandra.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Reply via email to