Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I wrote some scripts to test this: https://github.com/davidtinker/cassandra-perf 3 node cluster, each node: Intel® Xeon® E3-1270 v3 Quadcore Haswell 32GB RAM, 1 x 2TB commit log disk, 2 x 4TB data disks (RAID0) Using a batch of prepared statements is about 5% faster than inline parameters: InsertBatchOfPreparedStatements: Inserted 2551704 rows in 10 batches using 256 concurrent operations in 15.785 secs, 161653 rows/s, 6335 batches/s InsertInlineBatch: Inserted 2551704 rows in 10 batches using 256 concurrent operations in 16.712 secs, 152686 rows/s, 5983 batches/s On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I use hand-rolled batches a lot. You can get a *lot* of performance improvement. Just make sure to sanitize your strings. I¹ve been wondering, what¹s the limit, practical or hard, on the length of a query? Robert On 12/11/13, 3:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-g et-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
Network latency is the reason why the batched query is fastest. One trip to Cassandra versus 1000. If you execute the inserts in parallel, then that eliminates the latency issue. From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 5:40 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-d ata-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote: Network latency is the reason why the batched query is fastest. One trip to Cassandra versus 1000. If you execute the inserts in parallel, then that eliminates the latency issue. While it is true a batch will means only one client-server round trip, I'll note that provided you use the TokenAware load balancing policy, doing the parallelization client will save you intra-replica round-trips, which using a big batch won't. So that it might not be all that clear which ones is faster. And very large batches have the disadvantage that your are more likely to get a timeout (and if you do, you have to retry the whole batch, even though most of it has probably be inserted correctly). Overall, the best option probably has to do with parallelizing the inserts of reasonably sized batches, but what are the sizes for that is likely very use case dependent, you'll have to test. -- Sylvain From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 5:40 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I didn't do any warming up etc. I am new to Cassandra and was just poking around with some scripts to try to find the fastest way to do things. That said all the mini-tests ran under the same conditions. In our case the batches will have a variable number of different inserts/updates in them so doing a whole batch as a PreparedStatement won't help. However using BatchStatement and stuffing it full of repeated PreparedStatement's might be better than a batch with inlined parameters. I will do a test of that and see. I will also let the VM warm up and whatnot this time. On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
Very good point. I¹ve written code to do a very large number of inserts, but I¹ve only ever run it on a single-node cluster. I may very well find out when I run it against a multinode cluster that the performance benefits of large unlogged batches mostly go away. From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 6:52 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote: Network latency is the reason why the batched query is fastest. One trip to Cassandra versus 1000. If you execute the inserts in parallel, then that eliminates the latency issue. While it is true a batch will means only one client-server round trip, I'll note that provided you use the TokenAware load balancing policy, doing the parallelization client will save you intra-replica round-trips, which using a big batch won't. So that it might not be all that clear which ones is faster. And very large batches have the disadvantage that your are more likely to get a timeout (and if you do, you have to retry the whole batch, even though most of it has probably be inserted correctly). Overall, the best option probably has to do with parallelizing the inserts of reasonably sized batches, but what are the sizes for that is likely very use case dependent, you'll have to test. -- Sylvain From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 5:40 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use
What is the fastest way to get data into Cassandra 2 from a Java application?
I have tried the DataStax Java driver and it seems the fastest way to insert data is to compose a CQL string with all parameters inline. This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I should probably give you a number which is about 300 meg / s via thrift api and use 1mb batches On Dec 10, 2013, at 5:14 AM, graham sanderson gra...@vast.com wrote: Perhaps not the way forward, however I can bulk insert data via astyanax at a rate that maxes out our (fast) networks. That said for our next release (of this part of our product - our other current is node.js via binary protocol) we will be looking at insert speed via java driver, and also alternative scala/java implementations of the binary protocol. On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com wrote: I have tried the DataStax Java driver and it seems the fastest way to insert data is to compose a CQL string with all parameters inline. This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application smime.p7s Description: S/MIME cryptographic signature
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
Hmm. I have read that the thrift interface to Cassandra is out of favour and the CQL interface is in. Where does that leave Astyanax? On Tue, Dec 10, 2013 at 1:14 PM, graham sanderson gra...@vast.com wrote: Perhaps not the way forward, however I can bulk insert data via astyanax at a rate that maxes out our (fast) networks. That said for our next release (of this part of our product - our other current is node.js via binary protocol) we will be looking at insert speed via java driver, and also alternative scala/java implementations of the binary protocol. On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com wrote: I have tried the DataStax Java driver and it seems the fastest way to insert data is to compose a CQL string with all parameters inline. This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I can’t speak for Astyanax; their thrift transport I believe is abstracted out, however the object model is very CF wide row vs table-y. I have no idea what the plans are for further Astyanax dev (maybe someone on this list), but I believe the thrift API is not going away, so considering Astyanax/thrift is an option, thought I’d imagine you wouldn’t gain much going down the CQL over thrift method, so you need to be able to model your data in “internal” form. Two reasons we may want to move to the binary protocol for reads: asynchronous ability (which is now in thrift but it seems unlikely to be utilized in cassandra) for writes: compression, since we are (currently) network bandwidth limited for enormous batch inserts (from hadoop) On Dec 10, 2013, at 6:44 AM, David Tinker david.tin...@gmail.com wrote: Hmm. I have read that the thrift interface to Cassandra is out of favour and the CQL interface is in. Where does that leave Astyanax? On Tue, Dec 10, 2013 at 1:14 PM, graham sanderson gra...@vast.com wrote: Perhaps not the way forward, however I can bulk insert data via astyanax at a rate that maxes out our (fast) networks. That said for our next release (of this part of our product - our other current is node.js via binary protocol) we will be looking at insert speed via java driver, and also alternative scala/java implementations of the binary protocol. On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com wrote: I have tried the DataStax Java driver and it seems the fastest way to insert data is to compose a CQL string with all parameters inline. This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration smime.p7s Description: S/MIME cryptographic signature
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
The session.execute blocks until the C* returns the response. Use the async version, but do so with caution. If you don't throttle the requests, you will start seeing timeouts on the client side pretty quickly. For throttling I've used a Semaphore, but I think Guava's RateLimiter is better suited. And if you want to wait until all the writes have finished, definitely use Guava's futures API. Try something like, PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)); RateLimiter permits = RateLimiter.create(500);// you will need to tune this to your environment int count = 1000; final CountDownLatch latch = new CountDownLatch(count); for (int i = 0; i count; i++) { ResultSetFuture future = session.executeAsync(ps.bind( + i, aa + i)); Futures.addCallback(future, new FutureCallbackResultSet() { public void onSuccess(ResultSet rows) { latch.countDown(); } public void onFailure(Throwable t) { latch.countDown(); // log the error or other error handling } }); } latch.await(); // need to handle and/or throw InterruptedException On Tue, Dec 10, 2013 at 8:16 PM, graham sanderson gra...@vast.com wrote: I can’t speak for Astyanax; their thrift transport I believe is abstracted out, however the object model is very CF wide row vs table-y. I have no idea what the plans are for further Astyanax dev (maybe someone on this list), but I believe the thrift API is not going away, so considering Astyanax/thrift is an option, thought I’d imagine you wouldn’t gain much going down the CQL over thrift method, so you need to be able to model your data in “internal” form. Two reasons we may want to move to the binary protocol for reads: asynchronous ability (which is now in thrift but it seems unlikely to be utilized in cassandra) for writes: compression, since we are (currently) network bandwidth limited for enormous batch inserts (from hadoop) On Dec 10, 2013, at 6:44 AM, David Tinker david.tin...@gmail.com wrote: Hmm. I have read that the thrift interface to Cassandra is out of favour and the CQL interface is in. Where does that leave Astyanax? On Tue, Dec 10, 2013 at 1:14 PM, graham sanderson gra...@vast.com wrote: Perhaps not the way forward, however I can bulk insert data via astyanax at a rate that maxes out our (fast) networks. That said for our next release (of this part of our product - our other current is node.js via binary protocol) we will be looking at insert speed via java driver, and also alternative scala/java implementations of the binary protocol. On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com wrote: I have tried the DataStax Java driver and it seems the fastest way to insert data is to compose a CQL string with all parameters inline. This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration -- - John