[jira] [Commented] (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Matthew F. Dennis (JIRA) Thu, 07 Apr 2011 14:29:45 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017157#comment-13017157
 ]


Matthew F. Dennis commented on CASSANDRA-1278:
----------------------------------------------

The attached v2 patch implements the requested proxy server. bin/proxyloader 
starts the proxy server while proxy.conf, proxy-env.sh and log4j-proxy.conf 
control the the configuration of the proxy server. Defaults are provided and 
described.

As requested, there is no CPT implementation in this patch; existing code using 
CPT should be able to use BareMemtableManager directly without too much trouble.

The proxy server accepts calls via Thrift RPC and then eventually streams the 
results as an SSTable to C* which will schedule a build of the indexes, bloom 
filters and secondary indexes.

In addition to the extensive suite of functional tests, testing was done across 
5 C* nodes at RF=3 and 3 proxy nodes all running on EC2 XL instances. stress.py 
was run on the same nodes as the proxy and at 4 threads was easily able to 
saturate the CPU. As expected the performance on each proxy is marginally 
higher than native C*. The important differences are:

    * a (much) lower number of threads is required to fully utilize all the 
available CPU (though adding more threads did not diminish the throughput).
    * the proxy does not return timeouts, but instead just slows input if it's 
completely overloaded.
    * moved the load from C* to the proxies as intended, essentially requiring 
only one core to build the indexes/filters once the data is streamed.


> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.7.5
>
>         Attachments: 1278-cassandra-0.7-v2.txt, 1278-cassandra-0.7.1.txt, 
> 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>          Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> Currently bulk loading into Cassandra is a black art.  People are either 
> directed to just do it responsibly with thrift or a higher level client, or 
> they have to explore the contrib/bmt example - 
> http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires 
> delving into the code to find out how it works and then applying it to the 
> given problem.  Using either method, the user also needs to keep in mind that 
> overloading the cluster is possible - which will hopefully be addressed in 
> CASSANDRA-685
> This improvement would be to create a contrib module or set of documents 
> dealing with bulk loading.  Perhaps it could include code in the Core to make 
> it more pluggable for external clients of different types.
> It is just that this is something that many that are new to Cassandra need to 
> do - bulk load their data into Cassandra.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Reply via email to