Re: 2 nodes cassandra cluster raid10 or JBOD
Hi, What about using JBOD and replication factor 2? Regards. On 11 Dec 2013 02:03, cem cayiro...@gmail.com wrote: Hi all, I need to setup 2 nodes Cassandra cluster. I know that Datastax recommends using JBOD as a disk configuration and have replication for the redundancy. I was planning to use RAID 10 but using JBOD can save 50% disk space and increase the performance . But I am not sure I should use JBOD with 2 nodes cluster since there is a higher chance to lose 50% of our cluster compare to a larger cluster. I may prefer to have stronger nodes if I have limited number of nodes. What do you think about that? Is there anyone who has 2 nodes cluster? Best Regards, Cem
Cyclop - CQL3 web based editor
Hi all, This is the Cassandra mailing list, but I've developed something that is strictly related to Cassandra, and some of you might find it useful, so I've decided to send email to this group. This is web based CQL3 editor. The idea is, to deploy it once and have simple and comfortable CQL3 interface over web - without need to install anything. The editor itself supports code completion, not only based on CQL syntax, but also based database content - so for example the select statement will suggest tables from active keyspace, or in where closure only columns from table provided after select from The results are displayed in reversed table - rows horizontally and columns vertically. It seems to be more natural for column oriented database. You can also export query results to CSV, or add query as browser bookmark. The whole application is based on wicket + bootstrap + spring and can be deployed in any web 3.0 container. Here is the project (open source): https://github.com/maciejmiklas/cyclop Have a fun! Maciej
Re: Cyclop - CQL3 web based editor
Hi Maciej, Thanks for sharing it. On Wed, Dec 11, 2013 at 2:09 PM, Maciej Miklas mac.mik...@gmail.com wrote: Hi all, This is the Cassandra mailing list, but I've developed something that is strictly related to Cassandra, and some of you might find it useful, so I've decided to send email to this group. This is web based CQL3 editor. The idea is, to deploy it once and have simple and comfortable CQL3 interface over web - without need to install anything. The editor itself supports code completion, not only based on CQL syntax, but also based database content - so for example the select statement will suggest tables from active keyspace, or in where closure only columns from table provided after select from The results are displayed in reversed table - rows horizontally and columns vertically. It seems to be more natural for column oriented database. You can also export query results to CSV, or add query as browser bookmark. The whole application is based on wicket + bootstrap + spring and can be deployed in any web 3.0 container. Here is the project (open source): https://github.com/maciejmiklas/cyclop Have a fun! Maciej -- Thanks, Murali 99025-5
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
Re: nodetool repair keeping an empty cluster busy
Sven So basically when you run a repair you are essentially telling your cluster to run a validation compaction, which generates a merkle tree on all the nodes. These trees are used to identify the inconsistencies. So there is quite a bit of streaming which you see as your network traffic. Rahul On Wed, Dec 11, 2013 at 11:02 AM, Sven Stark sven.st...@m-square.com.auwrote: Corollary: what is getting shipped over the wire? The ganglia screenshot shows the network traffic on all the three hosts on which I ran the nodetool repair. [image: Inline image 1] remember UN 10.1.2.11 107.47 KB 256 32.9% 1f800723-10e4-4dcd-841f-73709a81d432 rack1 UN 10.1.2.10 127.67 KB 256 32.4% bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639 rack1 UN 10.1.2.12 107.62 KB 256 34.7% 5258f178-b20e-408f-a7bf-b6da2903e026 rack1 Much appreciated. Sven On Wed, Dec 11, 2013 at 3:56 PM, Sven Stark sven.st...@m-square.com.auwrote: Howdy! Not a matter of life or death, just curious. I've just stood up a three node cluster (v1.2.8) on three c3.2xlarge boxes in AWS. Silly me forgot the correct replication factor for one of the needed keyspaces. So I changed it via cli and ran a nodetool repair. Well .. there is no data at all in the keyspace yet, only the definition and nodetool repair ran about 20minutes using 2 of the 8 CPU fully. Any hints what nodetool repair is doing on an empty cluster that makes the host spin so hard? Cheers, Sven == Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie Cpu(s): 22.7%us, 1.0%sy, 2.9%ni, 73.0%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st Mem: 15339196k total, 7474360k used, 7864836k free, 251904k buffers Swap:0k total,0k used,0k free, 798324k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 10840 cassandr 20 0 8354m 4.1g 19m S 218 28.0 35:25.73 jsvc 16675 kafka 20 0 3987m 192m 12m S2 1.3 0:47.89 java 20328 root 20 0 5613m 569m 16m S2 3.8 1:35.13 jsvc 5969 exhibito 20 0 6423m 116m 12m S1 0.8 0:25.87 java 14436 tomcat7 20 0 3701m 167m 11m S1 1.1 0:25.80 java 6278 exhibito 20 0 6487m 119m 9984 S0 0.8 0:22.63 java 17713 storm 20 0 6033m 159m 11m S0 1.1 0:10.99 java 18769 storm 20 0 5773m 156m 11m S0 1.0 0:10.71 java root@xxx-01:~# nodetool -h `hostname` status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns Host ID Rack UN 10.1.2.11 107.47 KB 256 32.9% 1f800723-10e4-4dcd-841f-73709a81d432 rack1 UN 10.1.2.10 127.67 KB 256 32.4% bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639 rack1 UN 10.1.2.12 107.62 KB 256 34.7% 5258f178-b20e-408f-a7bf-b6da2903e026 rack1 root@xxx-01:~# nodetool -h `hostname` compactionstats pending tasks: 1 compaction typekeyspace column family completed total unit progress Active compaction remaining time :n/a root@xxx-01:~# nodetool -h `hostname` netstats Mode: NORMAL Not sending any streams. Not receiving any streams. Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool NameActive Pending Completed Commandsn/a 0 57155 Responses n/a 0 14573 image.png
Re: nodetool repair keeping an empty cluster busy
Hi Rahul, thanks for replying. Could you please be a bit more specific, though. Eg what exactly is being compacted - there is/was no data at all in the cluster save for a few hundred kB in the system CF (see the nodetool status output). Or - how can those few hundred kB in data generate Gb of network traffic? Cheers, Sven On Wed, Dec 11, 2013 at 7:56 PM, Rahul Menon ra...@apigee.com wrote: Sven So basically when you run a repair you are essentially telling your cluster to run a validation compaction, which generates a merkle tree on all the nodes. These trees are used to identify the inconsistencies. So there is quite a bit of streaming which you see as your network traffic. Rahul On Wed, Dec 11, 2013 at 11:02 AM, Sven Stark sven.st...@m-square.com.auwrote: Corollary: what is getting shipped over the wire? The ganglia screenshot shows the network traffic on all the three hosts on which I ran the nodetool repair. [image: Inline image 1] remember UN 10.1.2.11 107.47 KB 256 32.9% 1f800723-10e4-4dcd-841f-73709a81d432 rack1 UN 10.1.2.10 127.67 KB 256 32.4% bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639 rack1 UN 10.1.2.12 107.62 KB 256 34.7% 5258f178-b20e-408f-a7bf-b6da2903e026 rack1 Much appreciated. Sven On Wed, Dec 11, 2013 at 3:56 PM, Sven Stark sven.st...@m-square.com.auwrote: Howdy! Not a matter of life or death, just curious. I've just stood up a three node cluster (v1.2.8) on three c3.2xlarge boxes in AWS. Silly me forgot the correct replication factor for one of the needed keyspaces. So I changed it via cli and ran a nodetool repair. Well .. there is no data at all in the keyspace yet, only the definition and nodetool repair ran about 20minutes using 2 of the 8 CPU fully. Any hints what nodetool repair is doing on an empty cluster that makes the host spin so hard? Cheers, Sven == Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie Cpu(s): 22.7%us, 1.0%sy, 2.9%ni, 73.0%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st Mem: 15339196k total, 7474360k used, 7864836k free, 251904k buffers Swap:0k total,0k used,0k free, 798324k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 10840 cassandr 20 0 8354m 4.1g 19m S 218 28.0 35:25.73 jsvc 16675 kafka 20 0 3987m 192m 12m S2 1.3 0:47.89 java 20328 root 20 0 5613m 569m 16m S2 3.8 1:35.13 jsvc 5969 exhibito 20 0 6423m 116m 12m S1 0.8 0:25.87 java 14436 tomcat7 20 0 3701m 167m 11m S1 1.1 0:25.80 java 6278 exhibito 20 0 6487m 119m 9984 S0 0.8 0:22.63 java 17713 storm 20 0 6033m 159m 11m S0 1.1 0:10.99 java 18769 storm 20 0 5773m 156m 11m S0 1.0 0:10.71 java root@xxx-01:~# nodetool -h `hostname` status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns Host ID Rack UN 10.1.2.11 107.47 KB 256 32.9% 1f800723-10e4-4dcd-841f-73709a81d432 rack1 UN 10.1.2.10 127.67 KB 256 32.4% bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639 rack1 UN 10.1.2.12 107.62 KB 256 34.7% 5258f178-b20e-408f-a7bf-b6da2903e026 rack1 root@xxx-01:~# nodetool -h `hostname` compactionstats pending tasks: 1 compaction typekeyspace column family completed total unit progress Active compaction remaining time :n/a root@xxx-01:~# nodetool -h `hostname` netstats Mode: NORMAL Not sending any streams. Not receiving any streams. Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool NameActive Pending Completed Commandsn/a 0 57155 Responses n/a 0 14573 image.png
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I use hand-rolled batches a lot. You can get a *lot* of performance improvement. Just make sure to sanitize your strings. I¹ve been wondering, what¹s the limit, practical or hard, on the length of a query? Robert On 12/11/13, 3:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-g et-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
Network latency is the reason why the batched query is fastest. One trip to Cassandra versus 1000. If you execute the inserts in parallel, then that eliminates the latency issue. From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 5:40 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-d ata-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: Try to configure commitlog_archiving.properties
Bonnet Jonathan jonathan.bonnet at externe.bnpparibas.com writes: Thanks a lot, It Works, i see commit log bein archived. I'll try tomorrow the restore command. Thanks again. Bonnet Jonathan. Hello, I have restart a node today, and i have an error which seems to be in relation with commitlog archiving: ERROR 14:39:00,435 Exception encountered during startup java.lang.RuntimeException: java.io.IOException: Cannot run program : error=2, No such file or directory at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:172) at org.apache.cassandra.db.commitlog.CommitLog.recover (CommitLog.java:104) at org.apache.cassandra.service.CassandraDaemon.setup (CassandraDaemon.java:305) at org.apache.cassandra.service.CassandraDaemon.activate (CassandraDaemon.java:461) at org.apache.cassandra.service.CassandraDaemon.main (CassandraDaemon.java:504) Caused by: java.io.IOException: Cannot run program : error=2, No such file or directory at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.cassandra.utils.FBUtilities.exec(FBUtilities.java:588) at org.apache.cassandra.db.commitlog.CommitLogArchiver.exec (CommitLogArchiver.java:182) at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:168) ... 4 more Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.init(Unknown Source) at java.lang.ProcessImpl.start(Unknown Source) ... 8 more java.lang.RuntimeException: java.io.IOException: Cannot run program : error=2, No such file or directory at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:172) at org.apache.cassandra.db.commitlog.CommitLog.recover (CommitLog.java:104) at org.apache.cassandra.service.CassandraDaemon.setup (CassandraDaemon.java:305) at org.apache.cassandra.service.CassandraDaemon.activate (CassandraDaemon.java:461) at org.apache.cassandra.service.CassandraDaemon.main (CassandraDaemon.java:504) Caused by: java.io.IOException: Cannot run program : error=2, No such file or directory at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.cassandra.utils.FBUtilities.exec (FBUtilities.java:588) at org.apache.cassandra.db.commitlog.CommitLogArchiver.exec (CommitLogArchiver.java:182) at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:168) ... 4 more Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.init(Unknown Source) at java.lang.ProcessImpl.start(Unknown Source) ... 8 more No help again on the net, nothing change since the last changes in commitlog_archiving.properties. The first time yesterday that i restart there was no problem,and my commitlog bein archived well. Someone can help me, please ? Regards, Bonnet Jonathan.
Re: Try to configure commitlog_archiving.properties
hi Bonnet, that doesn't seem to be a problem with your archiving, rather with the restoring. What is your restore command? -- artur On 11/12/13 13:47, Bonnet Jonathan. wrote: Bonnet Jonathan jonathan.bonnet at externe.bnpparibas.com writes: Thanks a lot, It Works, i see commit log bein archived. I'll try tomorrow the restore command. Thanks again. Bonnet Jonathan. Hello, I have restart a node today, and i have an error which seems to be in relation with commitlog archiving: ERROR 14:39:00,435 Exception encountered during startup java.lang.RuntimeException: java.io.IOException: Cannot run program : error=2, No such file or directory at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:172) at org.apache.cassandra.db.commitlog.CommitLog.recover (CommitLog.java:104) at org.apache.cassandra.service.CassandraDaemon.setup (CassandraDaemon.java:305) at org.apache.cassandra.service.CassandraDaemon.activate (CassandraDaemon.java:461) at org.apache.cassandra.service.CassandraDaemon.main (CassandraDaemon.java:504) Caused by: java.io.IOException: Cannot run program : error=2, No such file or directory at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.cassandra.utils.FBUtilities.exec(FBUtilities.java:588) at org.apache.cassandra.db.commitlog.CommitLogArchiver.exec (CommitLogArchiver.java:182) at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:168) ... 4 more Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.init(Unknown Source) at java.lang.ProcessImpl.start(Unknown Source) ... 8 more java.lang.RuntimeException: java.io.IOException: Cannot run program : error=2, No such file or directory at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:172) at org.apache.cassandra.db.commitlog.CommitLog.recover (CommitLog.java:104) at org.apache.cassandra.service.CassandraDaemon.setup (CassandraDaemon.java:305) at org.apache.cassandra.service.CassandraDaemon.activate (CassandraDaemon.java:461) at org.apache.cassandra.service.CassandraDaemon.main (CassandraDaemon.java:504) Caused by: java.io.IOException: Cannot run program : error=2, No such file or directory at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.cassandra.utils.FBUtilities.exec (FBUtilities.java:588) at org.apache.cassandra.db.commitlog.CommitLogArchiver.exec (CommitLogArchiver.java:182) at org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive (CommitLogArchiver.java:168) ... 4 more Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.init(Unknown Source) at java.lang.ProcessImpl.start(Unknown Source) ... 8 more No help again on the net, nothing change since the last changes in commitlog_archiving.properties. The first time yesterday that i restart there was no problem,and my commitlog bein archived well. Someone can help me, please ? Regards, Bonnet Jonathan.
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote: Network latency is the reason why the batched query is fastest. One trip to Cassandra versus 1000. If you execute the inserts in parallel, then that eliminates the latency issue. While it is true a batch will means only one client-server round trip, I'll note that provided you use the TokenAware load balancing policy, doing the parallelization client will save you intra-replica round-trips, which using a big batch won't. So that it might not be all that clear which ones is faster. And very large batches have the disadvantage that your are more likely to get a timeout (and if you do, you have to retry the whole batch, even though most of it has probably be inserted correctly). Overall, the best option probably has to do with parallelizing the inserts of reasonably sized batches, but what are the sizes for that is likely very use case dependent, you'll have to test. -- Sylvain From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 5:40 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I didn't do any warming up etc. I am new to Cassandra and was just poking around with some scripts to try to find the fastest way to do things. That said all the mini-tests ran under the same conditions. In our case the batches will have a variable number of different inserts/updates in them so doing a whole batch as a PreparedStatement won't help. However using BatchStatement and stuffing it full of repeated PreparedStatement's might be better than a batch with inlined parameters. I will do a test of that and see. I will also let the VM warm up and whatnot this time. On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: Try to configure commitlog_archiving.properties
Artur Kronenberg artur.kronenberg at openmarket.com writes: hi Bonnet, that doesn't seem to be a problem with your archiving, rather with the restoring. What is your restore command? -- artur On 11/12/13 13:47, Bonnet Jonathan. wrote: Thanks to answear so fast, I put nothing for restore ? should I ? cause i don't want to restore for the moment. Regards,
Data tombstoned during bulk loading 1.2.10 - 2.0.3
Hi all, We're running into a weird problem trying to migrate our data from a 1.2.10 cluster to a 2.0.3 one. I've taken a snapshot on the old cluster, and for each host there, I'm running sstableloader -d host of new cluster KEYSPACE/COLUMNFAMILY (the sstableloader process from the 2.0.3 distribution, the one from 1.2.10 only gets java.lang.RuntimeException: java.io.IOException: Connection reset by peer) it then copies the data successfully but when checking the data i noticed some rows seem to be missing. It turned out the data is not missing, but has been tombstoned. When I use sstable2json on the sstable on the destination cluster, it has metadata: {deletionInfo: {markedForDeleteAt:1796952039620607,localDeletionTime:0}}, whereas it doesn't have that in the source sstable. (Yes, this is a timestamp far into the future. All our hosts are properly synced through ntp). This has happened for a bunch of random rows. How is this possible? Naturally, copying the data again doesn't work to fix it, as the tombstone is far in the future. Apart from not having this happen at all, how can it be fixed? Best regards, Mathijs
Re: Try to configure commitlog_archiving.properties
So, looking at the code: public void maybeRestoreArchive() { if (Strings.isNullOrEmpty(restoreDirectories)) return; for (String dir : restoreDirectories.split(,)) { File[] files = new File(dir).listFiles(); if (files == null) { throw new RuntimeException(Unable to list director + dir); } for (File fromFile : files) { File toFile = new File(DatabaseDescriptor.getCommitLogLocation(), new CommitLogDescriptor(CommitLogSegment.getNextId()).fileName()); String command = restoreCommand.replace(%from, fromFile.getPath()); command = command.replace(%to, toFile.getPath()); try { exec(command); } catch (IOException e) { throw new RuntimeException(e); } } } } I would like someone to confirm that, but it might potentially be a bug. It does the right thing for an empty restore directory. However it ignores the fact that the restore command could be empty. So for you, jonathan, I reckon you have the restore directory set? You don't need that to be set in order to archive (only if you want to restore it). So set your restore_directory property to empty and you should get rid of those errors. The directory needs to be set when you enable the restore command. On a second look, I am almost certain this is a bug, as the maybeArchive command does correctly check for the command to not be empty or null. The maybeRestore command needs to do the same thing for the restoreCommand. If someone confirms, I am happy to raise a bug. cheers, artur On 11/12/13 14:09, Bonnet Jonathan. wrote: Artur Kronenberg artur.kronenberg at openmarket.com writes: hi Bonnet, that doesn't seem to be a problem with your archiving, rather with the restoring. What is your restore command? -- artur On 11/12/13 13:47, Bonnet Jonathan. wrote: Thanks to answear so fast, I put nothing for restore ? should I ? cause i don't want to restore for the moment. Regards,
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
Very good point. I¹ve written code to do a very large number of inserts, but I¹ve only ever run it on a single-node cluster. I may very well find out when I run it against a multinode cluster that the performance benefits of large unlogged batches mostly go away. From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 6:52 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote: Network latency is the reason why the batched query is fastest. One trip to Cassandra versus 1000. If you execute the inserts in parallel, then that eliminates the latency issue. While it is true a batch will means only one client-server round trip, I'll note that provided you use the TokenAware load balancing policy, doing the parallelization client will save you intra-replica round-trips, which using a big batch won't. So that it might not be all that clear which ones is faster. And very large batches have the disadvantage that your are more likely to get a timeout (and if you do, you have to retry the whole batch, even though most of it has probably be inserted correctly). Overall, the best option probably has to do with parallelizing the inserts of reasonably sized batches, but what are the sizes for that is likely very use case dependent, you'll have to test. -- Sylvain From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 11, 2013 at 5:40 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: What is the fastest way to get data into Cassandra 2 from a Java application? Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use
Re: Try to configure commitlog_archiving.properties
Thanks Artur, You're right i must comment restore directory too. Now i'll try to practice around restore. Regards, Bonnet Jonathan.
Re: How to create counter column family via Pycassa?
What are the all possible values for cf_kwargs ?? SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type=UTF8Type, ) - Here I want to specify, Column data types and row key type. How can I do that ? On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.com wrote: The column_validation_classes arg is just for defining individual column types. Glad you got it figured out, though. On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: Thanks for quick reply. Apparantly, I was trying this to get working cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE} sys.create_column_family('my_ks', 'vote_count', column_validation_classes=cf_kwargs) #1 But this works: sys.create_column_family('my_ks', 'vote_count', **cf_kwargs) #2 I thought #1 should work. On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.com wrote: The only thing that makes a CF a counter CF is that the default validation class is CounterColumnType, which you can set through SystemManager.create_column_family(). On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: I do not find a way to create a counter column family in Pycassa. This[1] does not help. Appreciate if someone can help me. Thanks 1. http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/
Re: How to create counter column family via Pycassa?
What options are available depends on what version of Cassandra you're using. You can specify the row key type with 'key_validation_class'. For column types, use 'column_validation_classes', which is a dict mapping column names to types. For example: sys.create_column_family('mykeyspace', 'users', column_validation_classes={'username': UTF8Type, 'age': IntegerType}) On Wed, Dec 11, 2013 at 10:32 AM, Kumar Ranjan winnerd...@gmail.com wrote: What are the all possible values for cf_kwargs ?? SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type=UTF8Type, ) - Here I want to specify, Column data types and row key type. How can I do that ? On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.com wrote: The column_validation_classes arg is just for defining individual column types. Glad you got it figured out, though. On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: Thanks for quick reply. Apparantly, I was trying this to get working cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE} sys.create_column_family('my_ks', 'vote_count', column_validation_classes=cf_kwargs) #1 But this works: sys.create_column_family('my_ks', 'vote_count', **cf_kwargs) #2 I thought #1 should work. On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.com wrote: The only thing that makes a CF a counter CF is that the default validation class is CounterColumnType, which you can set through SystemManager.create_column_family(). On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: I do not find a way to create a counter column family in Pycassa. This[1] does not help. Appreciate if someone can help me. Thanks 1. http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/
[no subject]
Hey Folks, So I am creating, column family using pycassaShell. See below: validators = { 'approved': 'BooleanType', 'text': 'UTF8Type', 'favorite_count':'IntegerType', 'retweet_count': 'IntegerType', 'expanded_url': 'UTF8Type', 'tuid': 'LongType', 'screen_name': 'UTF8Type', 'profile_image': 'UTF8Type', 'embedly_data': 'CompositeType', 'created_at':'UTF8Type', } SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type='CompositeType', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) I am getting this error: *InvalidRequestException*: InvalidRequestException(why='Invalid definition for comparator org.apache.cassandra.db.marshal.CompositeType.' My data will look like this: 'row_key' : { 'tid' : { 'expanded_url': u'http://instagram.com/p/hwDj2BJeBy/', 'text': '#snowinginNYC Makes me so happy\xe2\x9d\x840brittles0 \xe2\x9b\x84 @ Grumman Studios http://t.co/rlOvaYSfKa', 'profile_image': u' https://pbs.twimg.com/profile_images/3262070059/1e82f895559b904945d28cd3ab3947e5_normal.jpeg ', 'tuid': 339322611, 'approved': 'true', 'favorite_count': 0, 'screen_name': u'LonaVigi', 'created_at': u'Wed Dec 11 01:10:05 + 2013', 'embedly_data': {u'provider_url': u'http://instagram.com/', u'description': ulonavigi's photo on Instagram, u'title': u'#snwinginNYC Makes me so happy\u2744@0brittles0 \u26c4', u'url': u' http://distilleryimage7.ak.instagram.com/5b880dec61c711e3a50b129314edd3b_8.jpg', u'thumbnail_width': 640, u'height': 640, u'width': 640, u'thumbnail_url': u' http://distilleryimage7.ak.instagram.com/b880dec61c711e3a50b1293d14edd3b_8.jpg', u'author_name': u'lonavigi', u'version': u'1.0', u'provider_name': u'Instagram', u'type': u'poto', u'thumbnail_height': 640, u'author_url': u' http://instagram.com/lonavigi'}, 'tid': 410577192746500096, 'retweet_count': 0 } }
Re: Cyclop - CQL3 web based editor
Hi Maciej, This looks great! Thanks for building this. On Wed, Dec 11, 2013 at 12:45 AM, Murali muralidharan@gmail.com wrote: Hi Maciej, Thanks for sharing it. On Wed, Dec 11, 2013 at 2:09 PM, Maciej Miklas mac.mik...@gmail.comwrote: Hi all, This is the Cassandra mailing list, but I've developed something that is strictly related to Cassandra, and some of you might find it useful, so I've decided to send email to this group. This is web based CQL3 editor. The idea is, to deploy it once and have simple and comfortable CQL3 interface over web - without need to install anything. The editor itself supports code completion, not only based on CQL syntax, but also based database content - so for example the select statement will suggest tables from active keyspace, or in where closure only columns from table provided after select from The results are displayed in reversed table - rows horizontally and columns vertically. It seems to be more natural for column oriented database. You can also export query results to CSV, or add query as browser bookmark. The whole application is based on wicket + bootstrap + spring and can be deployed in any web 3.0 container. Here is the project (open source): https://github.com/maciejmiklas/cyclop Have a fun! Maciej -- Thanks, Murali 99025-5 -- Best, Parth
Re: How to create counter column family via Pycassa?
validators = { 'approved': 'BooleanType', 'text': 'UTF8Type', 'favorite_count':'IntegerType', 'retweet_count': 'IntegerType', 'expanded_url': 'UTF8Type', 'tuid': 'LongType', 'screen_name': 'UTF8Type', 'profile_image': 'UTF8Type', 'embedly_data': 'CompositeType', 'created_at':'UTF8Type', } SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type='CompositeType', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) throws: *InvalidRequestException*: InvalidRequestException(why='Invalid definition for comparator org.apache.cassandra.db.marshal.CompositeType. Can you please explain why? On Wed, Dec 11, 2013 at 12:08 PM, Tyler Hobbs ty...@datastax.com wrote: What options are available depends on what version of Cassandra you're using. You can specify the row key type with 'key_validation_class'. For column types, use 'column_validation_classes', which is a dict mapping column names to types. For example: sys.create_column_family('mykeyspace', 'users', column_validation_classes={'username': UTF8Type, 'age': IntegerType}) On Wed, Dec 11, 2013 at 10:32 AM, Kumar Ranjan winnerd...@gmail.comwrote: What are the all possible values for cf_kwargs ?? SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type=UTF8Type, ) - Here I want to specify, Column data types and row key type. How can I do that ? On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.com wrote: The column_validation_classes arg is just for defining individual column types. Glad you got it figured out, though. On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: Thanks for quick reply. Apparantly, I was trying this to get working cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE} sys.create_column_family('my_ks', 'vote_count', column_validation_classes=cf_kwargs) #1 But this works: sys.create_column_family('my_ks', 'vote_count', **cf_kwargs) #2 I thought #1 should work. On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.comwrote: The only thing that makes a CF a counter CF is that the default validation class is CounterColumnType, which you can set through SystemManager.create_column_family(). On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: I do not find a way to create a counter column family in Pycassa. This[1] does not help. Appreciate if someone can help me. Thanks 1. http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/
Re: How to create counter column family via Pycassa?
This works, When I remove the comparator_type validators = { 'tid': 'IntegerType', 'approved': 'BooleanType', 'text': 'UTF8Type', 'favorite_count':'IntegerType', 'retweet_count': 'IntegerType', 'expanded_url': 'UTF8Type', 'tuid': 'LongType', 'screen_name': 'UTF8Type', 'profile_image': 'UTF8Type', 'embedly_data': 'BytesType', 'created_at':'UTF8Type', } SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) On Wed, Dec 11, 2013 at 12:23 PM, Kumar Ranjan winnerd...@gmail.com wrote: I am using ccm cassandra version *1.2.11* On Wed, Dec 11, 2013 at 12:19 PM, Kumar Ranjan winnerd...@gmail.comwrote: validators = { 'approved': 'BooleanType', 'text': 'UTF8Type', 'favorite_count':'IntegerType', 'retweet_count': 'IntegerType', 'expanded_url': 'UTF8Type', 'tuid': 'LongType', 'screen_name': 'UTF8Type', 'profile_image': 'UTF8Type', 'embedly_data': 'CompositeType', 'created_at':'UTF8Type', } SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type='CompositeType', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) throws: *InvalidRequestException*: InvalidRequestException(why='Invalid definition for comparator org.apache.cassandra.db.marshal.CompositeType. Can you please explain why? On Wed, Dec 11, 2013 at 12:08 PM, Tyler Hobbs ty...@datastax.com wrote: What options are available depends on what version of Cassandra you're using. You can specify the row key type with 'key_validation_class'. For column types, use 'column_validation_classes', which is a dict mapping column names to types. For example: sys.create_column_family('mykeyspace', 'users', column_validation_classes={'username': UTF8Type, 'age': IntegerType}) On Wed, Dec 11, 2013 at 10:32 AM, Kumar Ranjan winnerd...@gmail.comwrote: What are the all possible values for cf_kwargs ?? SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type=UTF8Type, ) - Here I want to specify, Column data types and row key type. How can I do that ? On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.comwrote: The column_validation_classes arg is just for defining individual column types. Glad you got it figured out, though. On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: Thanks for quick reply. Apparantly, I was trying this to get working cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE} sys.create_column_family('my_ks', 'vote_count', column_validation_classes=cf_kwargs) #1 But this works: sys.create_column_family('my_ks', 'vote_count', **cf_kwargs) #2 I thought #1 should work. On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.comwrote: The only thing that makes a CF a counter CF is that the default validation class is CounterColumnType, which you can set through SystemManager.create_column_family(). On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: I do not find a way to create a counter column family in Pycassa. This[1] does not help. Appreciate if someone can help me. Thanks 1. http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/
Bulkoutputformat
Hi All, I want to bulk insert data into cassandra. I was wondering of using BulkOutputformat in hadoop. Is it the best way or using driver and doing batch insert is the better way. Are there any disandvantages of using bulkoutputformat. Thanks for helping Varun
efficient way to store 8-bit or 16-bit value?
What do people recommend I do to store a small binary value in a column? I’d rather not simply use a 32-bit int for a single byte value. Can I have a one byte blob? Or should I store it as a single character ASCII string? I imagine each is going to have the overhead of storing the length (or null termination in the case of a string). That overhead may be worse than simply using a 32-bit int. Also is it possible to partition on a single character or substring of characters from a string (or a portion of a blob)? Something like: CREATE TABLE test ( id text, value blob, PRIMARY KEY (string[0:1]) )
Re: efficient way to store 8-bit or 16-bit value?
Column metadata is about 20 bytes. So, there is no big difference if you save 1 or 4 bytes. Thank you, Andrey On Wed, Dec 11, 2013 at 2:42 PM, onlinespending onlinespend...@gmail.comwrote: What do people recommend I do to store a small binary value in a column? I’d rather not simply use a 32-bit int for a single byte value. Can I have a one byte blob? Or should I store it as a single character ASCII string? I imagine each is going to have the overhead of storing the length (or null termination in the case of a string). That overhead may be worse than simply using a 32-bit int. Also is it possible to partition on a single character or substring of characters from a string (or a portion of a blob)? Something like: CREATE TABLE test ( id text, value blob, PRIMARY KEY (string[0:1]) )
Re: nodetool repair keeping an empty cluster busy
On Wed, Dec 11, 2013 at 1:35 AM, Sven Stark sven.st...@m-square.com.auwrote: thanks for replying. Could you please be a bit more specific, though. Eg what exactly is being compacted - there is/was no data at all in the cluster save for a few hundred kB in the system CF (see the nodetool status output). Or - how can those few hundred kB in data generate Gb of network traffic? The only answer I can come up with is that the Merkle trees generated and compared by repair are of a fixed size, and don't scale with the data present in the cluster. While I'm pretty sure each node can be aware that it has little to no data to repair, it generates and compares the trees anyway. It's a bit surprising that this might be Gbs of network traffic... The system keyspace will always have some data in it, have you tried only compacting your empty keyspace instead of the whole node? If so, and it exhibits the same behavior, that seems like a bug or at least unexpected behavior to me. If you're running a modern version of Cassandra, I would file a JIRA. =Rob
Re: AddContractPoint /VIP
What is the good practice to put in the code as addContactPoint ie.,how many servers ? I use the same nodes as the seed list nodes for that DC. The idea of the seed list is that it’s a list of well known nodes, and it’s easier operationally to say we have one list of well known nodes that is used by the servers and the clients. 1) I am also thinking to put this way here I am not sure this good or bad if i conigure 4 serves into one VIP ( virtual IP/virtual DNS) and specifying that DSN in the code as ContactPoint, so that that VIP is smart enough to route to different nodes. Too complicated. 2) Is that problem if i use multiple Data centers in future ? You only need to give the client the local seeds, it will discover all the nodes. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 7:12 am, chandra Varahala hadoopandcassan...@gmail.com wrote: Greetings, I have 4 node cassandra cluster that will grow upt to 10 nodes,we are using CQL Java client to access the data. What is the good practice to put in the code as addContactPoint ie.,how many servers ? 1) I am also thinking to put this way here I am not sure this good or bad if i conigure 4 serves into one VIP ( virtual IP/virtual DNS) and specifying that DSN in the code as ContactPoint, so that that VIP is smart enough to route to different nodes. 2) Is that problem if i use multiple Data centers in future ? thanks Chandra
Re: Write performance with 1.2.12
Changed memtable_total_space_in_mb to 1024 still no luck. Reducing memtable_total_space_in_mb will increase the frequency of flushing to disk, which will create more for compaction to do and result in increased IO. You should return it to the default. when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. What are you measuring, request latency or local read/write latency ? If it’s write latency it’s probably GC, if it’s read is probably IO or data model. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ?
Re: OOMs during high (read?) load in Cassandra 1.2.11
Do you have the back trace for from the heap dump so we can see what the array was and what was using it ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 10/12/2013, at 4:41 am, Klaus Brunner klaus.brun...@gmail.com wrote: 2013/12/9 Nate McCall n...@thelastpickle.com: Do you have any secondary indexes defined in the schema? That could lead to a 'mega row' pretty easily depending on the cardinality of the value. That's an interesting point - but no, we don't have any secondary indexes anywhere. From the heap dump, it's fairly evident that it's not a single huge row but actually many rows. I'll keep watching if this occurs again, or if the compaction fixed it for good. Thanks, Klaus
Re: Data Modelling Information
create table messages( body text, username text, tags settext PRIMARY keys(username,tags) ) This statement is syntactically invalid, also you cannot use a collection type in the primary key. 1) I should be able to query by username and get all the messages for a particular username yes. 2) I should be able to query by tags and username ( likes select * from messages where username='xya' and tags in ('awesome','phone')) No. 3) I should be able to query all messages by day and order by desc and limit to some value No. Could you guys please let me know if creating a secondary index on the tags field? No, it’s not supported. Or what would be the best way to model this data. You need to describe the problem and how you want to read the data. I suggest taking a look at the data modelling videos from Patrick here http://planetcassandra.org/Learn/CassandraCommunityWebinars Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 10/12/2013, at 8:57 am, Shrikar archak shrika...@gmail.com wrote: Hi Data Model Experts, I have a few questions with data modelling for a particular application. example create table messages( body text, username text, tags settext PRIMARY keys(username,tags) ) Requirements 1) I should be able to query by username and get all the messages for a particular username 2) I should be able to query by tags and username ( likes select * from messages where username='xya' and tags in ('awesome','phone')) 3) I should be able to query all messages by day and order by desc and limit to some value Could you guys please let me know if creating a secondary index on the tags field? Or what would be the best way to model this data. Thanks, Shrikar
Re: Write performance with 1.2.12
Thanks Aaron On Wed, Dec 11, 2013 at 8:15 PM, Aaron Morton aa...@thelastpickle.comwrote: Changed memtable_total_space_in_mb to 1024 still no luck. Reducing memtable_total_space_in_mb will increase the frequency of flushing to disk, which will create more for compaction to do and result in increased IO. You should return it to the default. You are right, had to revert it back to default. when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. What are you measuring, request latency or local read/write latency ? If it’s write latency it’s probably GC, if it’s read is probably IO or data model. It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. I don't see any GC activity in logs. Tried to control the compaction by reducing the number of threads, did not help much. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80
Re: Nodetool repair exceptions in Cassandra 2.0.2
[2013-12-08 11:04:02,047] Repair session ff16c510-5ff7-11e3-97c0-5973cc397f8f for range (1246984843639507027,1266616572749926276] failed with error org.apache.cassandra.exceptions.RepairException: [repair #ff16c510-5ff7-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48 the 10.x.x.48 node sent a tree response (merkle tree) to this node that did not contain the tree. This node then killed the repair session. Look for log messages on 10.x.x.48 that correlate with the repair session ID above. They may look like logger.error(Failed creating a merkle tree for + desc + , + initiator + (see log for details)”); or logger.info(String.format([repair #%s] Sending completed merkle tree to %s for %s/%s, desc.sessionId, initiator, desc.keyspace, desc.columnFamily)); Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 10/12/2013, at 12:57 pm, Laing, Michael michael.la...@nytimes.com wrote: My experience is that you must upgrade to 2.0.3 ASAP to fix this. Michael On Mon, Dec 9, 2013 at 6:39 PM, David Laube d...@stormpath.com wrote: Hi All, We are running Cassandra 2.0.2 and have recently stumbled upon an issue with nodetool repair. Upon running nodetool repair on each of the 5 nodes in the ring (one at a time) we observe the following exceptions returned to standard out; [2013-12-08 11:04:02,047] Repair session ff16c510-5ff7-11e3-97c0-5973cc397f8f for range (1246984843639507027,1266616572749926276] failed with error org.apache.cassandra.exceptions.RepairException: [repair #ff16c510-5ff7-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48 [2013-12-08 11:04:02,063] Repair session 284c8b40-5ff8-11e3-97c0-5973cc397f8f for range (-109256956528331396,-89316884701275697] failed with error org.apache.cassandra.exceptions.RepairException: [repair #284c8b40-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family2, (-109256956528331396,-89316884701275697]] Validation failed in /10.x.x.103 [2013-12-08 11:04:02,070] Repair session 399e7160-5ff8-11e3-97c0-5973cc397f8f for range (8901153810410866970,8915879751739915956] failed with error org.apache.cassandra.exceptions.RepairException: [repair #399e7160-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (8901153810410866970,8915879751739915956]] Validation failed in /10.x.x.103 [2013-12-08 11:04:02,072] Repair session 3ea73340-5ff8-11e3-97c0-5973cc397f8f for range (1149084504576970235,1190026362216198862] failed with error org.apache.cassandra.exceptions.RepairException: [repair #3ea73340-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1149084504576970235,1190026362216198862]] Validation failed in /10.x.x.103 [2013-12-08 11:04:02,091] Repair session 6f0da460-5ff8-11e3-97c0-5973cc397f8f for range (-5407189524618266750,-5389231566389960750] failed with error org.apache.cassandra.exceptions.RepairException: [repair #6f0da460-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (-5407189524618266750,-5389231566389960750]] Validation failed in /10.x.x.103 [2013-12-09 23:16:36,962] Repair session 7efc2740-6127-11e3-97c0-5973cc397f8f for range (1246984843639507027,1266616572749926276] failed with error org.apache.cassandra.exceptions.RepairException: [repair #7efc2740-6127-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48 [2013-12-09 23:16:36,986] Repair session a8c44260-6127-11e3-97c0-5973cc397f8f for range (-109256956528331396,-89316884701275697] failed with error org.apache.cassandra.exceptions.RepairException: [repair #a8c44260-6127-11e3-97c0-5973cc397f8f on keyspace_name/col_family2, (-109256956528331396,-89316884701275697]] Validation failed in /10.x.x.210 The /var/log/cassandra/system.log shows similar info as above with no real explanation as to the root cause behind the exception(s). There also does not appear to be any additional info in /var/log/cassandra/cassandra.log. We have tried restoring a recent snapshot of the keyespace in question to a separate staging ring and the repair runs successfully and without exception there. This is even after we tried insert/delete on the keyspace in the separate staging ring. Has anyone seen this behavior before and what can we do to resolve this? Any assistance would be greatly appreciated. Best regards, -Dave
Re: setting PIG_INPUT_INITIAL_ADDRESS environment . variable in Oozie for cassandra ...¿?
Caused by: java.io.IOException: PIG_INPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS environment variable not set at org.apache.cassandra.hadoop.pig.CassandraStorage.setLocation(CassandraStorage.java:314) at org.apache.cassandra.hadoop.pig.CassandraStorage.getSchema(CassandraStorage.java:358) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:151) ... 35 more Have you checked these are set ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 11/12/2013, at 4:00 am, Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com wrote: Hi, I have an error with pig action in oozie 4.0.0 using cassandraStorage. (cassandra 1.2.10) I can run pig scripts right with cassandra. but whe I try to use cassandraStorage to load data I have this error: Run pig script using PigRunner.run() for Pig version 0.8+ Apache Pig version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25 Run pig script using PigRunner.run() for Pig version 0.8+ 2013-12-10 12:24:39,084 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25 2013-12-10 12:24:39,084 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25 2013-12-10 12:24:39,095 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_201312100858_0007/attempt_201312100858_0007_m_00_0/work/pig-job_201312100858_0007.log 2013-12-10 12:24:39,095 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_201312100858_0007/attempt_201312100858_0007_m_00_0/work/pig-job_201312100858_0007.log 2013-12-10 12:24:39,501 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.228.243.18:9000 2013-12-10 12:24:39,501 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.228.243.18:9000 2013-12-10 12:24:39,510 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.228.243.18:9001 2013-12-10 12:24:39,510 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.228.243.18:9001 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2245: file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:155) at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:110) at org.apache.pig.newplan.logical.relational.LOStore.getSchema(LOStore.java:68) at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.validate(SchemaAliasVisitor.java:60) at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.visit(SchemaAliasVisitor.java:84) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1617) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1611) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1334) at org.apache.pig.PigServer.execute(PigServer.java:1239) at org.apache.pig.PigServer.executeBatch(PigServer.java:362) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:430) at org.apache.pig.PigRunner.run(PigRunner.java:49) at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:283) at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:223) at
Re: Exactly one wide row per node for a given CF?
Querying the table was fast. What I didn’t do was test the table under load, nor did I try this in a multi-node cluster. As the number of columns in a row increases so does the size of the column index which is read as part of the read path. For background and comparisons of latency see http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html or my talk on performance at the SF summit last year http://thelastpickle.com/speaking/2012/08/08/Cassandra-Summit-SF.html While the column index has been lifted to the -Index.db component AFAIK it must still be fully loaded. Larger rows take longer to go through compaction, tend to cause more JVM GC and have issue during repair. See the in_memory_compaction_limit_in_mb comments in the yaml file. During repair we detect differences in ranges of rows and stream them between the nodes. If you have wide rows and a single column is our of sync we will create a new copy of that row on the node, which must then be compacted. I’ve seen the load on nodes with very wide rows go down by 150GB just by reducing the compaction settings. IMHO all things been equal rows in the few 10’s of MB work better. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 11/12/2013, at 2:41 am, Robert Wille rwi...@fold3.com wrote: I have a question about this statement: When rows get above a few 10’s of MB things can slow down, when they get above 50 MB they can be a pain, when they get above 100MB it’s a warning sign. And when they get above 1GB, well you you don’t want to know what happens then. I tested a data model that I created. Here’s the schema for the table in question: CREATE TABLE bdn_index_pub ( tree INT, pord INT, hpath VARCHAR, PRIMARY KEY (tree, pord) ); As a test, I inserted 100 million records. tree had the same value for every record, and I had 100 million values for pord. hpath averaged about 50 characters in length. My understanding is that all 100 million strings would have been stored in a single row, since they all had the same value for the first component of the primary key. I didn’t look at the size of the table, but it had to be several gigs (uncompressed). Contrary to what Aaron says, I do want to know what happens, because I didn’t experience any issues with this table during my test. Inserting was fast. The last batch of records inserted in approximately the same amount of time as the first batch. Querying the table was fast. What I didn’t do was test the table under load, nor did I try this in a multi-node cluster. If this is bad, can somebody suggest a better pattern? This table was designed to support a query like this: select hpath from bdn_index_pub where tree = :tree and pord = :start and pord = :end. In my application, most trees will have less than a million records. A handful will have 10’s of millions, and one of them will have 100 million. If I need to break up my rows, my first instinct would be to divide each tree into blocks of say 10,000 and change tree to a string that contains the tree and the block number. Something like this: 17:0, 0, ‘/’ … 17:0, , ’/a/b/c’ 17:1,1, ‘/a/b/d’ … I’d then need to issue an extra query for ranges that crossed block boundaries. Any suggestions on a better pattern? Thanks Robert From: Aaron Morton aa...@thelastpickle.com Reply-To: user@cassandra.apache.org Date: Tuesday, December 10, 2013 at 12:33 AM To: Cassandra User user@cassandra.apache.org Subject: Re: Exactly one wide row per node for a given CF? But this becomes troublesome if I add or remove nodes. What effectively I want is to partition on the unique id of the record modulus N (id % N; where N is the number of nodes). This is exactly the problem consistent hashing (used by cassandra) is designed to solve. If you hash the key and modulo the number of nodes, adding and removing nodes requires a lot of data to move. I want to be able to randomly distribute a large set of records but keep them clustered in one wide row per node. Sounds like you should revisit your data modelling, this is a pretty well known anti pattern. When rows get above a few 10’s of MB things can slow down, when they get above 50 MB they can be a pain, when they get above 100MB it’s a warning sign. And when they get above 1GB, well you you don’t want to know what happens then. It’s a bad idea and you should take another look at the data model. If you have to do it, you can try the ByteOrderedPartitioner which uses the row key as a token, given you total control of the row placement. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 4/12/2013, at 8:32 pm, Vivek Mishra
user / password authentication advice
Hi, I’m using Cassandra in an environment where many users can login to use an application I’m developing. I’m curious if anyone has any advice or links to documentation / blogs where it discusses common implementations or best practices for user and password authentication. My cursory search online didn’t bring much up on the subject. I suppose the information needn’t even be specific to Cassandra. I imagine a few basic steps will be as follows: user types in username (e.g. email address) and password this is verified against a table storing username and passwords (encrypted in some way) a token is return to the app / web browser to allow further transactions using secure token (e.g. cookie) Obviously I’m only scratching the surface and it’s the detail and best practices of implementing this user / password authentication that I’m curious about. Thank you, Ben
Re: Data tombstoned during bulk loading 1.2.10 - 2.0.3
On Wed, Dec 11, 2013 at 6:27 AM, Mathijs Vogelzang math...@apptornado.comwrote: When I use sstable2json on the sstable on the destination cluster, it has metadata: {deletionInfo: {markedForDeleteAt:1796952039620607,localDeletionTime:0}}, whereas it doesn't have that in the source sstable. (Yes, this is a timestamp far into the future. All our hosts are properly synced through ntp). This seems like a bug in sstableloader, I would report it on JIRA. Naturally, copying the data again doesn't work to fix it, as the tombstone is far in the future. Apart from not having this happen at all, how can it be fixed? Briefly, you'll want to purge that tombstone and then reload the data with a reasonable timestamp. Dealing with rows with data (and tombstones) in the far future is described in detail here : http://thelastpickle.com/blog/2011/12/15/Anatomy-of-a-Cassandra-Partition.html =Rob
Re:
SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type='CompositeType', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) CompositeType is a type composed of other types, see http://pycassa.github.io/pycassa/assorted/composite_types.html?highlight=compositetype Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 6:15 am, Kumar Ranjan winnerd...@gmail.com wrote: Hey Folks, So I am creating, column family using pycassaShell. See below: validators = { 'approved': 'BooleanType', 'text': 'UTF8Type', 'favorite_count':'IntegerType', 'retweet_count': 'IntegerType', 'expanded_url': 'UTF8Type', 'tuid': 'LongType', 'screen_name': 'UTF8Type', 'profile_image': 'UTF8Type', 'embedly_data': 'CompositeType', 'created_at':'UTF8Type', } SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type='CompositeType', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) I am getting this error: InvalidRequestException: InvalidRequestException(why='Invalid definition for comparator org.apache.cassandra.db.marshal.CompositeType.' My data will look like this: 'row_key' : { 'tid' : { 'expanded_url': u'http://instagram.com/p/hwDj2BJeBy/', 'text': '#snowinginNYC Makes me so happy\xe2\x9d\x840brittles0 \xe2\x9b\x84 @ Grumman Studios http://t.co/rlOvaYSfKa', 'profile_image': u'https://pbs.twimg.com/profile_images/3262070059/1e82f895559b904945d28cd3ab3947e5_normal.jpeg', 'tuid': 339322611, 'approved': 'true', 'favorite_count': 0, 'screen_name': u'LonaVigi', 'created_at': u'Wed Dec 11 01:10:05 + 2013', 'embedly_data': {u'provider_url': u'http://instagram.com/', u'description': ulonavigi's photo on Instagram, u'title': u'#snwinginNYC Makes me so happy\u2744@0brittles0 \u26c4', u'url': u'http://distilleryimage7.ak.instagram.com/5b880dec61c711e3a50b129314edd3b_8.jpg', u'thumbnail_width': 640, u'height': 640, u'width': 640, u'thumbnail_url': u'http://distilleryimage7.ak.instagram.com/b880dec61c711e3a50b1293d14edd3b_8.jpg', u'author_name': u'lonavigi', u'version': u'1.0', u'provider_name': u'Instagram', u'type': u'poto', u'thumbnail_height': 640, u'author_url': u'http://instagram.com/lonavigi'}, 'tid': 410577192746500096, 'retweet_count': 0 } }
Re: Cyclop - CQL3 web based editor
thanks, looks handy. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 6:16 am, Parth Patil parthpa...@gmail.com wrote: Hi Maciej, This looks great! Thanks for building this. On Wed, Dec 11, 2013 at 12:45 AM, Murali muralidharan@gmail.com wrote: Hi Maciej, Thanks for sharing it. On Wed, Dec 11, 2013 at 2:09 PM, Maciej Miklas mac.mik...@gmail.com wrote: Hi all, This is the Cassandra mailing list, but I've developed something that is strictly related to Cassandra, and some of you might find it useful, so I've decided to send email to this group. This is web based CQL3 editor. The idea is, to deploy it once and have simple and comfortable CQL3 interface over web - without need to install anything. The editor itself supports code completion, not only based on CQL syntax, but also based database content - so for example the select statement will suggest tables from active keyspace, or in where closure only columns from table provided after select from The results are displayed in reversed table - rows horizontally and columns vertically. It seems to be more natural for column oriented database. You can also export query results to CSV, or add query as browser bookmark. The whole application is based on wicket + bootstrap + spring and can be deployed in any web 3.0 container. Here is the project (open source): https://github.com/maciejmiklas/cyclop Have a fun! Maciej -- Thanks, Murali 99025-5 -- Best, Parth
Re: CLUSTERING ORDER CQL3
You need to specify all the clustering key components in the CLUSTERING ORDER BY clause create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH CLUSTERING ORDER BY (cid ASC, ts DESC); cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 10:44 am, Shrikar archak shrika...@gmail.com wrote: Hi All, My Usecase I want query result by ordered by timestamp DESC. But I don't want timestamp to be the second column in the primary key as that will take of my querying capability for example create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH CLUSTERING ORDER BY (ts DESC); Queries required: I want the result for all the below queries to be in DESC order of timestamp select * from demo where oid = 100; select * from demo where oid = 100 and cid = 10; select * from demo where oid = 100 and cid = 100 and ts minTimeuuid('something'); I am trying to create this table with CLUSTERING ORDER IN CQL and getting this error cqlsh:viralheat create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH CLUSTERING ORDER BY (ts desc); Bad Request: Missing CLUSTERING ORDER for column cid In this document it mentions that we can have multple keys for cluster ordering. any one know how to do that? Go here Datastax doc If I make the timestamp the second column then I cant have queries likes select * from demo where oid = 100 and cid = 100 and ts minTimeuuid('something'); Thanks, Shrikar
Re: Bulkoutputformat
If you don’t need to use Hadoop then try the SSTableSimpleWriter and sstableloader , this post is a little old but still relevant http://www.datastax.com/dev/blog/bulk-loading Otherwise AFAIK BulkOutputFormat is what you want from hadoop http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com wrote: Hi All, I want to bulk insert data into cassandra. I was wondering of using BulkOutputformat in hadoop. Is it the best way or using driver and doing batch insert is the better way. Are there any disandvantages of using bulkoutputformat. Thanks for helping Varun
Re: efficient way to store 8-bit or 16-bit value?
What do people recommend I do to store a small binary value in a column? I’d rather not simply use a 32-bit int for a single byte value. blob is a byte array or you could use the varint, a variable length integer, but you probably want the blob. cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 1:33 pm, Andrey Ilinykh ailin...@gmail.com wrote: Column metadata is about 20 bytes. So, there is no big difference if you save 1 or 4 bytes. Thank you, Andrey On Wed, Dec 11, 2013 at 2:42 PM, onlinespending onlinespend...@gmail.com wrote: What do people recommend I do to store a small binary value in a column? I’d rather not simply use a 32-bit int for a single byte value. Can I have a one byte blob? Or should I store it as a single character ASCII string? I imagine each is going to have the overhead of storing the length (or null termination in the case of a string). That overhead may be worse than simply using a 32-bit int. Also is it possible to partition on a single character or substring of characters from a string (or a portion of a blob)? Something like: CREATE TABLE test ( id text, value blob, PRIMARY KEY (string[0:1]) )
Re: Write performance with 1.2.12
It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. Is this the local write latency or the cluster wide write request latency ? What sort of numbers are you seeing ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 3:39 pm, srmore comom...@gmail.com wrote: Thanks Aaron On Wed, Dec 11, 2013 at 8:15 PM, Aaron Morton aa...@thelastpickle.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. Reducing memtable_total_space_in_mb will increase the frequency of flushing to disk, which will create more for compaction to do and result in increased IO. You should return it to the default. You are right, had to revert it back to default. when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. What are you measuring, request latency or local read/write latency ? If it’s write latency it’s probably GC, if it’s read is probably IO or data model. It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. I don't see any GC activity in logs. Tried to control the compaction by reducing the number of threads, did not help much. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00
Re: user / password authentication advice
Not sure if you are asking about the authentication authorisation in cassandra or how to implemented the same using cassandra. info on the cassandra authentication and authorisation is here http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/security/securityTOC.html Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 4:31 pm, onlinespending onlinespend...@gmail.com wrote: Hi, I’m using Cassandra in an environment where many users can login to use an application I’m developing. I’m curious if anyone has any advice or links to documentation / blogs where it discusses common implementations or best practices for user and password authentication. My cursory search online didn’t bring much up on the subject. I suppose the information needn’t even be specific to Cassandra. I imagine a few basic steps will be as follows: user types in username (e.g. email address) and password this is verified against a table storing username and passwords (encrypted in some way) a token is return to the app / web browser to allow further transactions using secure token (e.g. cookie) Obviously I’m only scratching the surface and it’s the detail and best practices of implementing this user / password authentication that I’m curious about. Thank you, Ben
Re: user / password authentication advice
Hi! You're right, this isn't really Cassandra-specific. Most languages/web frameworks have their own way of doing user authentication, and then you just typically write a plugin that just stores whatever data the system needs in Cassandra. For example, if you're using Java (or Scala or Groovy or anything else JVM-based), Apache Shiro is a good way of doing user authentication and authorization. http://shiro.apache.org/. Just implement a custom Realm for Cassandra and you should be set. /Janne On Dec 12, 2013, at 05:31 , onlinespending onlinespend...@gmail.com wrote: Hi, I’m using Cassandra in an environment where many users can login to use an application I’m developing. I’m curious if anyone has any advice or links to documentation / blogs where it discusses common implementations or best practices for user and password authentication. My cursory search online didn’t bring much up on the subject. I suppose the information needn’t even be specific to Cassandra. I imagine a few basic steps will be as follows: user types in username (e.g. email address) and password this is verified against a table storing username and passwords (encrypted in some way) a token is return to the app / web browser to allow further transactions using secure token (e.g. cookie) Obviously I’m only scratching the surface and it’s the detail and best practices of implementing this user / password authentication that I’m curious about. Thank you, Ben
Re: user / password authentication advice
OK, thanks for getting me going in the right direction. I imagine most people would store password and tokenized authentication information in a single table, using the username (e.g. email address) as the key? On Dec 11, 2013, at 10:44 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: Hi! You're right, this isn't really Cassandra-specific. Most languages/web frameworks have their own way of doing user authentication, and then you just typically write a plugin that just stores whatever data the system needs in Cassandra. For example, if you're using Java (or Scala or Groovy or anything else JVM-based), Apache Shiro is a good way of doing user authentication and authorization. http://shiro.apache.org/. Just implement a custom Realm for Cassandra and you should be set. /Janne On Dec 12, 2013, at 05:31 , onlinespending onlinespend...@gmail.com wrote: Hi, I’m using Cassandra in an environment where many users can login to use an application I’m developing. I’m curious if anyone has any advice or links to documentation / blogs where it discusses common implementations or best practices for user and password authentication. My cursory search online didn’t bring much up on the subject. I suppose the information needn’t even be specific to Cassandra. I imagine a few basic steps will be as follows: user types in username (e.g. email address) and password this is verified against a table storing username and passwords (encrypted in some way) a token is return to the app / web browser to allow further transactions using secure token (e.g. cookie) Obviously I’m only scratching the surface and it’s the detail and best practices of implementing this user / password authentication that I’m curious about. Thank you, Ben