Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-13 Thread David Tinker
I wrote some scripts to test this: https://github.com/davidtinker/cassandra-perf

3 node cluster, each node: Intel® Xeon® E3-1270 v3 Quadcore Haswell
32GB RAM, 1 x 2TB commit log disk, 2 x 4TB data disks (RAID0)

Using a batch of prepared statements is about 5% faster than inline parameters:

InsertBatchOfPreparedStatements: Inserted 2551704 rows in 10
batches using 256 concurrent operations in 15.785 secs, 161653 rows/s,
6335 batches/s

InsertInlineBatch: Inserted 2551704 rows in 10 batches using 256
concurrent operations in 16.712 secs, 152686 rows/s, 5983 batches/s

On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the one
 you
 are testing) the performance difference is somewhat negligible, but seeing
 non
 prepared statement being significantly faster than prepared ones almost
 surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware of
 any such bug).

 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk
 of
 the whole time.

 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync.

 --
 Sylvain



 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com
 wrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
  i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES (
  + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing
  the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
  http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
 Integration





-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration


Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Sylvain Lebresne
 This loop takes 2500ms or so on my test cluster:

 PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
 (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + i));

 The same loop with the parameters inline is about 1300ms. It gets
 worse if there are many parameters.


Do you mean that:
  for (int i = 0; i  1000; i++)
  session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( +
i + , aa + i + ));
is twice as fast as using a prepared statement? And that the difference
is even greater if you add more columns than id and info?

That would certainly be unexpected, are you sure you're not re-preparing the
statement every time in the loop?

--
Sylvain

I know I can use batching to
 insert all the rows at once but thats not the purpose of this test. I
 also tried using session.execute(cql, params) and it is faster but
 still doesn't match inline values.

 Composing CQL strings is certainly convenient and simple but is there
 a much faster way?

 Thanks
 David

 I have also posted this on Stackoverflow if anyone wants the points:

 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application



Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Sylvain Lebresne
Then I suspect that this is artifact of your test methodology. Prepared
statements *are* faster than non prepared ones in general. They save some
parsing and some bytes on the wire. The savings will tend to be bigger for
bigger queries, and it's possible that for very small queries (like the one
you
are testing) the performance difference is somewhat negligible, but seeing
non
prepared statement being significantly faster than prepared ones almost
surely
means you're doing wrong (of course, a bug in either the driver or C* is
always
possible, and always make sure to test recent versions, but I'm not aware of
any such bug).

Are you sure you are warming up the JVMs (client and drivers) properly for
instance. 1000 iterations is *really small*, if you're not warming things
up properly, you're not measuring anything relevant. Also, are you including
the preparation of the query itself in the timing? Preparing a query is not
particulary fast, but it's meant to be done just once at the begining of the
application lifetime. But with only 1000 iterations, if you include the
preparation in the timing, it's entirely possible it's eating a good chunk
of
the whole time.

But other prepared versus non-prepared, you won't get proper performance
unless
you parallelize your inserts. Unlogged batches is one way to do it (it's
really
all Cassandra does with unlogged batch, parallelizing). But as John Sanda
mentioned, another option is to do the parallelization client side, with
executeAsync.

--
Sylvain



On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
 i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES (
 + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing
 the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
 Integration



Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Robert Wille
I use hand-rolled batches a lot. You can get a *lot* of performance
improvement. Just make sure to sanitize your strings.

I¹ve been wondering, what¹s the limit, practical or hard, on the length of
a query?

Robert

On 12/11/13, 3:37 AM, David Tinker david.tin...@gmail.com wrote:

Yes thats what I found.

This is faster:

for (int i = 0; i  1000; i++) session.execute(INSERT INTO
test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

Than this:

def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
?))
for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
i] as Object[]))

This is the fastest option of all (hand rolled batch):

StringBuilder b = new StringBuilder()
b.append(BEGIN UNLOGGED BATCH\n)
for (int i = 0; i  1000; i++) {
b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
VALUES (').append(i).append(',')
.append(aa).append(i).append(')\n)
}
b.append(APPLY BATCH\n)
session.execute(b.toString())


On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
wrote:

 This loop takes 2500ms or so on my test cluster:

 PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
 (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
i));

 The same loop with the parameters inline is about 1300ms. It gets
 worse if there are many parameters.


 Do you mean that:
   for (int i = 0; i  1000; i++)
   session.execute(INSERT INTO perf_test.wibble (id, info) VALUES
( + i
 + , aa + i + ));
 is twice as fast as using a prepared statement? And that the difference
 is even greater if you add more columns than id and info?

 That would certainly be unexpected, are you sure you're not
re-preparing the
 statement every time in the loop?

 --
 Sylvain

 I know I can use batching to
 insert all the rows at once but thats not the purpose of this test. I
 also tried using session.execute(cql, params) and it is faster but
 still doesn't match inline values.

 Composing CQL strings is certainly convenient and simple but is there
 a much faster way?

 Thanks
 David

 I have also posted this on Stackoverflow if anyone wants the points:

 
http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-g
et-data-into-cassandra-2-from-a-java-application





-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
Integration




Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Robert Wille
Network latency is the reason why the batched query is fastest. One trip to
Cassandra versus 1000. If you execute the inserts in parallel, then that
eliminates the latency issue.

From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, December 11, 2013 at 5:40 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: What is the fastest way to get data into Cassandra 2 from a
Java application?

Then I suspect that this is artifact of your test methodology. Prepared
statements *are* faster than non prepared ones in general. They save some
parsing and some bytes on the wire. The savings will tend to be bigger for
bigger queries, and it's possible that for very small queries (like the one
you
are testing) the performance difference is somewhat negligible, but seeing
non
prepared statement being significantly faster than prepared ones almost
surely
means you're doing wrong (of course, a bug in either the driver or C* is
always
possible, and always make sure to test recent versions, but I'm not aware of
any such bug).

Are you sure you are warming up the JVMs (client and drivers) properly for
instance. 1000 iterations is *really small*, if you're not warming things
up properly, you're not measuring anything relevant. Also, are you including
the preparation of the query itself in the timing? Preparing a query is not
particulary fast, but it's meant to be done just once at the begining of the
application lifetime. But with only 1000 iterations, if you include the
preparation in the timing, it's entirely possible it's eating a good chunk
of
the whole time.

But other prepared versus non-prepared, you won't get proper performance
unless
you parallelize your inserts. Unlogged batches is one way to do it (it's
really
all Cassandra does with unlogged batch, parallelizing). But as John Sanda
mentioned, another option is to do the parallelization client side, with
executeAsync. 

--
Sylvain



On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com
wrote:
 Yes thats what I found.
 
 This is faster:
 
 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))
 
 Than this:
 
 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))
 
 This is the fastest option of all (hand rolled batch):
 
 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())
 
 
 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + 
i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( +
i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing 
the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
  
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-d
 ata-into-cassandra-2-from-a-java-application
 
 
 
 
 
 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration





Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Sylvain Lebresne
On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote:

 Network latency is the reason why the batched query is fastest. One trip
 to Cassandra versus 1000. If you execute the inserts in parallel, then that
 eliminates the latency issue.


While it is true a batch will means only one client-server round trip, I'll
note that provided you use the TokenAware load balancing policy, doing the
parallelization client will save you intra-replica round-trips, which using
a big batch won't. So that it might not be all that clear which ones is
faster. And very large batches have the disadvantage that your are more
likely to get a timeout (and if you do, you have to retry the whole batch,
even though most of it has probably be inserted correctly). Overall, the
best option probably has to do with parallelizing the inserts of reasonably
sized batches, but what are the sizes for that is likely very use case
dependent, you'll have to test.

--
Sylvain




 From: Sylvain Lebresne sylv...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Wednesday, December 11, 2013 at 5:40 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: What is the fastest way to get data into Cassandra 2 from a
 Java application?

 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the
 one you
 are testing) the performance difference is somewhat negligible, but seeing
 non
 prepared statement being significantly faster than prepared ones almost
 surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware
 of
 any such bug).

 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you
 including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of
 the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk
 of
 the whole time.

 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync.

 --
 Sylvain



 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
 i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES
 ( + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not
 re-preparing the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues 

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread David Tinker
I didn't do any warming up etc. I am new to Cassandra and was just
poking around with some scripts to try to find the fastest way to do
things. That said all the mini-tests ran under the same conditions.

In our case the batches will have a variable number of different
inserts/updates in them so doing a whole batch as a PreparedStatement
won't help. However using BatchStatement and stuffing it full of
repeated PreparedStatement's might be better than a batch with inlined
parameters. I will do a test of that and see. I will also let the VM
warm up and whatnot this time.



On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the one
 you
 are testing) the performance difference is somewhat negligible, but seeing
 non
 prepared statement being significantly faster than prepared ones almost
 surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware of
 any such bug).

 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk
 of
 the whole time.

 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync.

 --
 Sylvain



 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com
 wrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
  i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES (
  + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing
  the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
  http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
 Integration





-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration


Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Robert Wille
Very good point. I¹ve written code to do a very large number of inserts, but
I¹ve only ever run it on a single-node cluster. I may very well find out
when I run it against a multinode cluster that the performance benefits of
large unlogged batches mostly go away.

From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, December 11, 2013 at 6:52 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: What is the fastest way to get data into Cassandra 2 from a
Java application?

On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote:
 Network latency is the reason why the batched query is fastest. One trip to
 Cassandra versus 1000. If you execute the inserts in parallel, then that
 eliminates the latency issue.

While it is true a batch will means only one client-server round trip, I'll
note that provided you use the TokenAware load balancing policy, doing the
parallelization client will save you intra-replica round-trips, which using
a big batch won't. So that it might not be all that clear which ones is
faster. And very large batches have the disadvantage that your are more
likely to get a timeout (and if you do, you have to retry the whole batch,
even though most of it has probably be inserted correctly). Overall, the
best option probably has to do with parallelizing the inserts of reasonably
sized batches, but what are the sizes for that is likely very use case
dependent, you'll have to test.

--
Sylvain

 
 
 From:  Sylvain Lebresne sylv...@datastax.com
 Reply-To:  user@cassandra.apache.org
 Date:  Wednesday, December 11, 2013 at 5:40 AM
 To:  user@cassandra.apache.org user@cassandra.apache.org
 Subject:  Re: What is the fastest way to get data into Cassandra 2 from a Java
 application?
 
 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the one
 you
 are testing) the performance difference is somewhat negligible, but seeing non
 prepared statement being significantly faster than prepared ones almost surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware of
 any such bug).
 
 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk of
 the whole time.
 
 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync. 
 
 --
 Sylvain
 
 
 
 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote:
 Yes thats what I found.
 
 This is faster:
 
 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))
 
 Than this:
 
 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))
 
 This is the fastest option of all (hand rolled batch):
 
 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())
 
 
 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
 i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( +
i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing
 the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use 

What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-10 Thread David Tinker
I have tried the DataStax Java driver and it seems the fastest way to
insert data is to compose a CQL string with all parameters inline.

This loop takes 2500ms or so on my test cluster:

PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
(id, info) VALUES (?, ?))
for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + i));

The same loop with the parameters inline is about 1300ms. It gets
worse if there are many parameters. I know I can use batching to
insert all the rows at once but thats not the purpose of this test. I
also tried using session.execute(cql, params) and it is faster but
still doesn't match inline values.

Composing CQL strings is certainly convenient and simple but is there
a much faster way?

Thanks
David

I have also posted this on Stackoverflow if anyone wants the points:
 
http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application


Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-10 Thread graham sanderson
I should probably give you a number which is about 300 meg / s via thrift api 
and use 1mb batches

On Dec 10, 2013, at 5:14 AM, graham sanderson gra...@vast.com wrote:

 Perhaps not the way forward, however I can bulk insert data via astyanax at a 
 rate that maxes out our (fast) networks. That said for our next release (of 
 this part of our product - our other current is node.js via binary protocol) 
 we will be looking at insert speed via java driver, and also alternative 
 scala/java implementations of the binary protocol.
 
 On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com wrote:
 
 I have tried the DataStax Java driver and it seems the fastest way to
 insert data is to compose a CQL string with all parameters inline.
 
 This loop takes 2500ms or so on my test cluster:
 
 PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
 (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + i));
 
 The same loop with the parameters inline is about 1300ms. It gets
 worse if there are many parameters. I know I can use batching to
 insert all the rows at once but thats not the purpose of this test. I
 also tried using session.execute(cql, params) and it is faster but
 still doesn't match inline values.
 
 Composing CQL strings is certainly convenient and simple but is there
 a much faster way?
 
 Thanks
 David
 
 I have also posted this on Stackoverflow if anyone wants the points:
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 



smime.p7s
Description: S/MIME cryptographic signature


Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-10 Thread David Tinker
Hmm. I have read that the thrift interface to Cassandra is out of
favour and the CQL interface is in. Where does that leave Astyanax?

On Tue, Dec 10, 2013 at 1:14 PM, graham sanderson gra...@vast.com wrote:
 Perhaps not the way forward, however I can bulk insert data via astyanax at a 
 rate that maxes out our (fast) networks. That said for our next release (of 
 this part of our product - our other current is node.js via binary protocol) 
 we will be looking at insert speed via java driver, and also alternative 
 scala/java implementations of the binary protocol.

 On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com wrote:

 I have tried the DataStax Java driver and it seems the fastest way to
 insert data is to compose a CQL string with all parameters inline.

 This loop takes 2500ms or so on my test cluster:

 PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
 (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + i));

 The same loop with the parameters inline is about 1300ms. It gets
 worse if there are many parameters. I know I can use batching to
 insert all the rows at once but thats not the purpose of this test. I
 also tried using session.execute(cql, params) and it is faster but
 still doesn't match inline values.

 Composing CQL strings is certainly convenient and simple but is there
 a much faster way?

 Thanks
 David

 I have also posted this on Stackoverflow if anyone wants the points:
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application




-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration


Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-10 Thread graham sanderson
I can’t speak for Astyanax; their thrift transport I believe is abstracted out, 
however the object model is very CF wide row vs table-y.

I have no idea what the plans are for further Astyanax dev (maybe someone on 
this list), but I believe the thrift API is not going away, so considering 
Astyanax/thrift is an option, thought I’d imagine you wouldn’t gain much going 
down the CQL over thrift method, so you need to be able to model your data in 
“internal” form.

Two reasons we may want to move to the binary protocol 
for reads: asynchronous ability (which is now in thrift but it seems unlikely 
to be utilized in cassandra)
for writes: compression, since we are (currently) network bandwidth limited for 
enormous batch inserts (from hadoop)

On Dec 10, 2013, at 6:44 AM, David Tinker david.tin...@gmail.com wrote:

 Hmm. I have read that the thrift interface to Cassandra is out of
 favour and the CQL interface is in. Where does that leave Astyanax?
 
 On Tue, Dec 10, 2013 at 1:14 PM, graham sanderson gra...@vast.com wrote:
 Perhaps not the way forward, however I can bulk insert data via astyanax at 
 a rate that maxes out our (fast) networks. That said for our next release 
 (of this part of our product - our other current is node.js via binary 
 protocol) we will be looking at insert speed via java driver, and also 
 alternative scala/java implementations of the binary protocol.
 
 On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com wrote:
 
 I have tried the DataStax Java driver and it seems the fastest way to
 insert data is to compose a CQL string with all parameters inline.
 
 This loop takes 2500ms or so on my test cluster:
 
 PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
 (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + i));
 
 The same loop with the parameters inline is about 1300ms. It gets
 worse if there are many parameters. I know I can use batching to
 insert all the rows at once but thats not the purpose of this test. I
 also tried using session.execute(cql, params) and it is faster but
 still doesn't match inline values.
 
 Composing CQL strings is certainly convenient and simple but is there
 a much faster way?
 
 Thanks
 David
 
 I have also posted this on Stackoverflow if anyone wants the points:
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 
 
 
 -- 
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration



smime.p7s
Description: S/MIME cryptographic signature


Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-10 Thread John Sanda
The session.execute blocks until the C* returns the response. Use the async
version, but do so with caution. If you don't throttle the requests, you
will start seeing timeouts on the client side pretty quickly. For
throttling I've used a Semaphore, but I think Guava's RateLimiter is better
suited. And if you want to wait until all the writes have finished,
definitely use Guava's futures API. Try something like,

PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
(id, info) VALUES (?, ?));
RateLimiter permits = RateLimiter.create(500);// you will need to tune
this to your environment
int count = 1000;
final CountDownLatch latch = new CountDownLatch(count);
for (int i = 0; i  count; i++) {
ResultSetFuture future = session.executeAsync(ps.bind( + i, aa +
i));
Futures.addCallback(future, new FutureCallbackResultSet() {
public void onSuccess(ResultSet rows) {
latch.countDown();
}

public void onFailure(Throwable t) {
latch.countDown();
// log the error or other error handling
}
});
}
latch.await();   // need to handle and/or throw InterruptedException



On Tue, Dec 10, 2013 at 8:16 PM, graham sanderson gra...@vast.com wrote:

 I can’t speak for Astyanax; their thrift transport I believe is abstracted
 out, however the object model is very CF wide row vs table-y.

 I have no idea what the plans are for further Astyanax dev (maybe someone
 on this list), but I believe the thrift API is not going away, so
 considering Astyanax/thrift is an option, thought I’d imagine you wouldn’t
 gain much going down the CQL over thrift method, so you need to be able to
 model your data in “internal” form.

 Two reasons we may want to move to the binary protocol
 for reads: asynchronous ability (which is now in thrift but it seems
 unlikely to be utilized in cassandra)
 for writes: compression, since we are (currently) network bandwidth
 limited for enormous batch inserts (from hadoop)

 On Dec 10, 2013, at 6:44 AM, David Tinker david.tin...@gmail.com wrote:

  Hmm. I have read that the thrift interface to Cassandra is out of
  favour and the CQL interface is in. Where does that leave Astyanax?
 
  On Tue, Dec 10, 2013 at 1:14 PM, graham sanderson gra...@vast.com
 wrote:
  Perhaps not the way forward, however I can bulk insert data via
 astyanax at a rate that maxes out our (fast) networks. That said for our
 next release (of this part of our product - our other current is node.js
 via binary protocol) we will be looking at insert speed via java driver,
 and also alternative scala/java implementations of the binary protocol.
 
  On Dec 10, 2013, at 4:49 AM, David Tinker david.tin...@gmail.com
 wrote:
 
  I have tried the DataStax Java driver and it seems the fastest way to
  insert data is to compose a CQL string with all parameters inline.
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
 i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters. I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 
 
 
  --
  http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
 Integration




-- 

- John