[ 
https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990183#comment-12990183
 ] 

Jonathan Ellis commented on CASSANDRA-1278:
-------------------------------------------

Thinking about this some more, I think we can really simplify it from a client 
perspective.

We could implement the Thrift Cassandra interface (the interface implemented by 
CassandraServer) but writes would be turned into streaming, 
serialized-byte-streams (by using Memtable + sort).  We would keep 
Memtable-per-replica-range, so the actual Cassandra node doesn't need to 
deserialize to potentially forward.  (Obviously we would not support any read 
operations.)

Then there is _zero_ need for new work on the client side -- you can use 
Hector, Pycassa, Aquiles, whatever.

Well, almost zero: we'd need a batch_complete call to say "we're done, now 
build 2ary indexes."  (per-sstable bloom + primary index can be built in 
parallel w/ the load, the way StreamIn currently does.)

Again, we could probably update the StreamIn/StreamOut interface to handle 
this.  It _may_ be simpler to create a new api but my guess is not.

> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.7.2
>
>         Attachments: 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>          Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> Currently bulk loading into Cassandra is a black art.  People are either 
> directed to just do it responsibly with thrift or a higher level client, or 
> they have to explore the contrib/bmt example - 
> http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires 
> delving into the code to find out how it works and then applying it to the 
> given problem.  Using either method, the user also needs to keep in mind that 
> overloading the cluster is possible - which will hopefully be addressed in 
> CASSANDRA-685
> This improvement would be to create a contrib module or set of documents 
> dealing with bulk loading.  Perhaps it could include code in the Core to make 
> it more pluggable for external clients of different types.
> It is just that this is something that many that are new to Cassandra need to 
> do - bulk load their data into Cassandra.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to