[jira] [Commented] (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Stu Hood (JIRA) Sat, 30 Apr 2011 13:53:46 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027374#comment-13027374
 ]


Stu Hood commented on CASSANDRA-1278:
-------------------------------------

bq. Maybe we've been too clever here: why not just write out the full sstable 
on the client, and stream it over (indexes and all) so that
As much as I want to merge the protocols, I'm not sure I like the limitations 
this puts on clients: being able to send a stream without needing local 
tempspace is very, very beneficial, IMO (for example, needing tempspace was by 
far the most annoying limitation of a Hadoop LuceneOutputFormat I worked on).

bq. If you're comparing to the streams we use for repair and similar, they 
require table names and byte ranges be known up front
bq. something we can deal with at the StreamInSession level, I don't think 
we'll need to change the protocol itself
With versioned messaging, changing the protocol is at least possible, if 
painful... my _dream_ would be:
# Deprecate the file ranges in Streaming session objects, to be replaced with 
framing in the stream
# Move the Streaming session object to a header of the streaming connection 
(almost identical to LoaderStream)
# Deprecate the Messaging based setup and teardown for streaming sessions: a 
sender initiates a stream by opening a streaming connection, and tears it down 
with success codes after each file (again, like this protocol)

----

tl;dr: I'd prefer some slight adjustments to Matt's protocol (mentioned above) 
over requiring tempspace on the client.

> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.8.1
>
>         Attachments: 1278-cassandra-0.7-v2.txt, 1278-cassandra-0.7.1.txt, 
> 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>          Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> Currently bulk loading into Cassandra is a black art.  People are either 
> directed to just do it responsibly with thrift or a higher level client, or 
> they have to explore the contrib/bmt example - 
> http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires 
> delving into the code to find out how it works and then applying it to the 
> given problem.  Using either method, the user also needs to keep in mind that 
> overloading the cluster is possible - which will hopefully be addressed in 
> CASSANDRA-685
> This improvement would be to create a contrib module or set of documents 
> dealing with bulk loading.  Perhaps it could include code in the Core to make 
> it more pluggable for external clients of different types.
> It is just that this is something that many that are new to Cassandra need to 
> do - bulk load their data into Cassandra.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Reply via email to