Hi Rüdiger, I just saw this after I answered on the SO thread: http://stackoverflow.com/questions/21778671/cassandra-how-to-insert-a-new-wide-row-with-good-performance-using-cql/21884943#21884943
On Wed, Feb 19, 2014 at 8:57 AM, John Sanda <john.sa...@gmail.com> wrote: > From a quick glance at your code, it looks like you are preparing your > insert statement multiple times. You only need to prepare it once. I would > expect to see some improvement with that change. > > > On Wed, Feb 19, 2014 at 5:27 AM, Rüdiger Klaehn <rkla...@gmail.com> wrote: > >> Hi all, >> >> I am evaluating Cassandra for satellite telemetry storage and analysis. I >> set up a little three node cluster on my local development machine and >> wrote a few simple test programs. >> >> My use case requires storing incoming telemetry updates in the database >> at the same rate as they are coming in. A telemetry update is a map of >> name/value pairs that arrives at a certain time. >> >> The idea is that I want to store the data as quickly as possible, and >> then later store it in an additional format that is more amenable to >> analysis. >> >> The format I have chosen for my test is the following: >> >> CREATE TABLE IF NOT EXISTS test.wide ( >> time varchar, >> name varchar, >> value varchar, >> PRIMARY KEY (time,name)) >> WITH COMPACT STORAGE >> >> The layout I want to achieve with this is something like this: >> >> +-------+-------+-------+-------+-------+-------+ >> | | name1 | name2 | name3 | ... | nameN | >> | time +-------+-------+-------+-------+-------+ >> | | val1 | val2 | val3 | ... | valN | >> +-------+-------+-------+-------|-------+-------+ >> >> (Time will at some point be some kind of timestamp, and value will become >> a blob. But this is just for initial testing) >> >> The problem is the following: I am getting very low performance for bulk >> inserts into the above table. In my test program, each insert has a new, >> unique time and creates a row with 10000 name/value pairs. This should map >> into creating a new row in the underlying storage engine, correct? I do >> that 1000 times and measure both time per insert and total time. >> >> I am getting about 0.5s for each insert of 10000 name/value pairs, which >> is much lower than the rate at which the telemetry is coming in at my >> system. I have read a few previous threads on this subject and am using >> batch prepared statements for maximum performance ( >> https://issues.apache.org/jira/browse/CASSANDRA-4693 ). But that does >> not help. >> >> Here is the CQL benchmark: >> https://gist.github.com/rklaehn/9089304#file-cassandratestminimized-scala >> >> I have written the exact same thing using the thrift API of astyanax, and >> I am getting much better performance. Each insert of 10000 name/values >> takes 0.04s using a ColumnListMutation. When I use async calls for both >> programs, as suggested by somebody on Stackoverflow, the difference gets >> even larger. The CQL insert remains at 0.5s per insert on average, whereas >> the astyanax ColumnListMutation approach takes 0.01s per insert on >> average, even on my test cluster. That's the kind of performance I need. >> >> Here is the thrift benchmark, modified from an ast example: >> https://gist.github.com/rklaehn/9089304#file-astclient-java >> >> I realize that running on a test cluster on localhost is not a 100% >> realistic test. But nevertheless you would expect both tests to have >> roughly similar performance. >> >> I saw a few suggestions to create a table with CQL and fill it using the >> thrift API. For example in this thread >> http://mail-archives.apache.org/mod_mbox/cassandra-user/201309.mbox/%3c523334b8.8070...@gmail.com%3E. >> But I would very much prefer to use pure CQL for this. It seems that the >> thrift API is considered deprecated, so I would not feel comfortable >> starting a new project using a legacy API. >> >> I already posted a question on SO about this, but did not get any >> satisfactory answer. Just general performance tuning tips that do nothing >> to explain the difference between the CQL and thrift approaches. >> >> http://stackoverflow.com/questions/21778671/cassandra-how-to-insert-a-new-wide-row-with-good-performance-using-cql >> >> Am I doing something wrong, or is this a fundamental limitation of CQL. >> If the latter is the case, what's the plan to mitigate the issue? >> >> There is a JIRA issue about this ( >> https://issues.apache.org/jira/browse/CASSANDRA-5959 ), but it is marked >> as a duplicate of https://issues.apache.org/jira/browse/CASSANDRA-4693 . >> But according to my benchmarks batch prepared statements do not solve this >> issue! >> >> I would really appreciate any help on this issue. The telemetry data I >> would like to import into C* for testing contains ~2*10^12 samples, where >> each sample consists of time, value and status. If quick batch insertion is >> not possible, I would not even be able to insert it in an acceptable time. >> >> best regards, >> >> Rüdiger >> > > > > -- > > - John > -- ----------------- Nate McCall Austin, TX @zznate Co-Founder & Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com