date:20131211

Re: 2 nodes cassandra cluster raid10 or JBOD

2013-12-11 Thread Veysel Taşçıoğlu

Hi,

What about using JBOD and replication factor 2?

Regards.
On 11 Dec 2013 02:03, cem cayiro...@gmail.com wrote:

 Hi all,

 I need to setup 2 nodes Cassandra cluster. I know that Datastax
 recommends using JBOD as a disk configuration and have replication for the
 redundancy. I was planning to use RAID 10 but using JBOD can save 50% disk
 space and increase the performance . But I am not sure I should use JBOD
 with 2 nodes cluster since there is a higher chance to lose 50% of our
 cluster compare to a larger cluster. I may prefer to have stronger nodes if
 I have limited number of nodes.



 What do you think about that? Is there anyone who has 2 nodes cluster?


 Best Regards,

 Cem

Cyclop - CQL3 web based editor

2013-12-11 Thread Maciej Miklas

Hi all,

This is the Cassandra mailing list, but I've developed something that is
strictly related to Cassandra, and some of you might find it useful, so
I've decided to send email to this group.

This is web based CQL3 editor. The idea is, to deploy it once and have
simple and comfortable CQL3 interface over web - without need to install
anything.

The editor itself supports code completion, not only based on CQL syntax,
but also based database content - so for example the select statement will
suggest tables from active keyspace, or in where closure only columns from
table provided after select from

The results are displayed in reversed table - rows horizontally and columns
vertically. It seems to be more natural for column oriented database.

You can also export query results to CSV, or add query as browser bookmark.

The whole application is based on wicket + bootstrap + spring and can be
deployed in any web 3.0 container.

Here is the project (open source): https://github.com/maciejmiklas/cyclop


Have a fun!
 Maciej

Re: Cyclop - CQL3 web based editor

2013-12-11 Thread Murali

Hi Maciej,
Thanks for sharing it.




On Wed, Dec 11, 2013 at 2:09 PM, Maciej Miklas mac.mik...@gmail.com wrote:

 Hi all,

 This is the Cassandra mailing list, but I've developed something that is
 strictly related to Cassandra, and some of you might find it useful, so
 I've decided to send email to this group.

 This is web based CQL3 editor. The idea is, to deploy it once and have
 simple and comfortable CQL3 interface over web - without need to install
 anything.

 The editor itself supports code completion, not only based on CQL syntax,
 but also based database content - so for example the select statement will
 suggest tables from active keyspace, or in where closure only columns from
 table provided after select from

 The results are displayed in reversed table - rows horizontally and
 columns vertically. It seems to be more natural for column oriented
 database.

 You can also export query results to CSV, or add query as browser bookmark.

 The whole application is based on wicket + bootstrap + spring and can be
 deployed in any web 3.0 container.

 Here is the project (open source): https://github.com/maciejmiklas/cyclop


 Have a fun!
  Maciej




-- 
Thanks,
Murali
99025-5

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Sylvain Lebresne

 This loop takes 2500ms or so on my test cluster:

 PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
 (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + i));

 The same loop with the parameters inline is about 1300ms. It gets
 worse if there are many parameters.


Do you mean that:
  for (int i = 0; i  1000; i++)
  session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( +
i + , aa + i + ));
is twice as fast as using a prepared statement? And that the difference
is even greater if you add more columns than id and info?

That would certainly be unexpected, are you sure you're not re-preparing the
statement every time in the loop?

--
Sylvain

I know I can use batching to
 insert all the rows at once but thats not the purpose of this test. I
 also tried using session.execute(cql, params) and it is faster but
 still doesn't match inline values.

 Composing CQL strings is certainly convenient and simple but is there
 a much faster way?

 Thanks
 David

 I have also posted this on Stackoverflow if anyone wants the points:

 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application

Re: nodetool repair keeping an empty cluster busy

2013-12-11 Thread Rahul Menon

Sven

So basically when you run a repair you are essentially telling your cluster
to run a validation compaction, which generates a merkle tree on all the
nodes. These trees are used to identify the inconsistencies. So there is
quite a bit of streaming which you see as your network traffic.

Rahul


On Wed, Dec 11, 2013 at 11:02 AM, Sven Stark sven.st...@m-square.com.auwrote:

 Corollary:

 what is getting shipped over the wire? The ganglia screenshot shows the
 network traffic on all the three hosts on which I ran the nodetool repair.

 [image: Inline image 1]

 remember

 UN  10.1.2.11  107.47 KB  256 32.9%
  1f800723-10e4-4dcd-841f-73709a81d432  rack1
 UN  10.1.2.10  127.67 KB  256 32.4%
  bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639  rack1
 UN  10.1.2.12  107.62 KB  256 34.7%
  5258f178-b20e-408f-a7bf-b6da2903e026  rack1

 Much appreciated.
 Sven


 On Wed, Dec 11, 2013 at 3:56 PM, Sven Stark sven.st...@m-square.com.auwrote:

 Howdy!

 Not a matter of life or death, just curious.

 I've just stood up a three node cluster (v1.2.8) on three c3.2xlarge
 boxes in AWS. Silly me forgot the correct replication factor for one of the
 needed keyspaces. So I changed it via cli and ran a nodetool repair.
 Well .. there is no data at all in the keyspace yet, only the definition
 and nodetool repair ran about 20minutes using 2 of the 8 CPU fully.

 Any hints what nodetool repair is doing on an empty cluster that makes
 the host spin so hard?

 Cheers,
 Sven

 ==

 Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
 Cpu(s): 22.7%us,  1.0%sy,  2.9%ni, 73.0%id,  0.0%wa,  0.0%hi,  0.4%si,
  0.0%st
 Mem:  15339196k total,  7474360k used,  7864836k free,   251904k buffers
 Swap:0k total,0k used,0k free,   798324k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 10840 cassandr  20   0 8354m 4.1g  19m S  218 28.0  35:25.73 jsvc
 16675 kafka 20   0 3987m 192m  12m S2  1.3   0:47.89 java
 20328 root  20   0 5613m 569m  16m S2  3.8   1:35.13 jsvc
  5969 exhibito  20   0 6423m 116m  12m S1  0.8   0:25.87 java
 14436 tomcat7   20   0 3701m 167m  11m S1  1.1   0:25.80 java
  6278 exhibito  20   0 6487m 119m 9984 S0  0.8   0:22.63 java
 17713 storm 20   0 6033m 159m  11m S0  1.1   0:10.99 java
 18769 storm 20   0 5773m 156m  11m S0  1.0   0:10.71 java

 root@xxx-01:~# nodetool -h `hostname` status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  AddressLoad   Tokens  Owns   Host ID
   Rack
 UN  10.1.2.11  107.47 KB  256 32.9%
  1f800723-10e4-4dcd-841f-73709a81d432  rack1
 UN  10.1.2.10  127.67 KB  256 32.4%
  bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639  rack1
 UN  10.1.2.12  107.62 KB  256 34.7%
  5258f178-b20e-408f-a7bf-b6da2903e026  rack1

 root@xxx-01:~# nodetool -h `hostname` compactionstats
 pending tasks: 1
   compaction typekeyspace   column family   completed
   total  unit  progress
 Active compaction remaining time :n/a

 root@xxx-01:~# nodetool -h `hostname` netstats
 Mode: NORMAL
 Not sending any streams.
 Not receiving any streams.
 Read Repair Statistics:
 Attempted: 0
 Mismatch (Blocking): 0
 Mismatch (Background): 0
 Pool NameActive   Pending  Completed
 Commandsn/a 0  57155
 Responses   n/a 0  14573



image.png

Re: nodetool repair keeping an empty cluster busy

2013-12-11 Thread Sven Stark

Hi Rahul,

thanks for replying. Could you please be a bit more specific, though. Eg
what exactly is being compacted - there is/was no data at all in the
cluster save for a few hundred kB in the system CF (see the nodetool status
output). Or - how can those few hundred kB in data generate Gb of network
traffic?

Cheers,
Sven



On Wed, Dec 11, 2013 at 7:56 PM, Rahul Menon ra...@apigee.com wrote:

 Sven

 So basically when you run a repair you are essentially telling your
 cluster to run a validation compaction, which generates a merkle tree on
 all the nodes. These trees are used to identify the inconsistencies. So
 there is quite a bit of streaming which you see as your network traffic.

 Rahul


 On Wed, Dec 11, 2013 at 11:02 AM, Sven Stark 
 sven.st...@m-square.com.auwrote:

 Corollary:

 what is getting shipped over the wire? The ganglia screenshot shows the
 network traffic on all the three hosts on which I ran the nodetool repair.

 [image: Inline image 1]

 remember

 UN  10.1.2.11  107.47 KB  256 32.9%
  1f800723-10e4-4dcd-841f-73709a81d432  rack1
 UN  10.1.2.10  127.67 KB  256 32.4%
  bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639  rack1
 UN  10.1.2.12  107.62 KB  256 34.7%
  5258f178-b20e-408f-a7bf-b6da2903e026  rack1

 Much appreciated.
 Sven


 On Wed, Dec 11, 2013 at 3:56 PM, Sven Stark 
 sven.st...@m-square.com.auwrote:

 Howdy!

 Not a matter of life or death, just curious.

 I've just stood up a three node cluster (v1.2.8) on three c3.2xlarge
 boxes in AWS. Silly me forgot the correct replication factor for one of the
 needed keyspaces. So I changed it via cli and ran a nodetool repair.
 Well .. there is no data at all in the keyspace yet, only the definition
 and nodetool repair ran about 20minutes using 2 of the 8 CPU fully.

 Any hints what nodetool repair is doing on an empty cluster that makes
 the host spin so hard?

 Cheers,
 Sven

 ==

 Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
 Cpu(s): 22.7%us,  1.0%sy,  2.9%ni, 73.0%id,  0.0%wa,  0.0%hi,  0.4%si,
  0.0%st
 Mem:  15339196k total,  7474360k used,  7864836k free,   251904k buffers
 Swap:0k total,0k used,0k free,   798324k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 10840 cassandr  20   0 8354m 4.1g  19m S  218 28.0  35:25.73 jsvc
 16675 kafka 20   0 3987m 192m  12m S2  1.3   0:47.89 java
 20328 root  20   0 5613m 569m  16m S2  3.8   1:35.13 jsvc
  5969 exhibito  20   0 6423m 116m  12m S1  0.8   0:25.87 java
 14436 tomcat7   20   0 3701m 167m  11m S1  1.1   0:25.80 java
  6278 exhibito  20   0 6487m 119m 9984 S0  0.8   0:22.63 java
 17713 storm 20   0 6033m 159m  11m S0  1.1   0:10.99 java
 18769 storm 20   0 5773m 156m  11m S0  1.0   0:10.71 java

 root@xxx-01:~# nodetool -h `hostname` status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  AddressLoad   Tokens  Owns   Host ID
   Rack
 UN  10.1.2.11  107.47 KB  256 32.9%
  1f800723-10e4-4dcd-841f-73709a81d432  rack1
 UN  10.1.2.10  127.67 KB  256 32.4%
  bd6b2059-e9dc-4b01-95ab-d7c4fc0ec639  rack1
 UN  10.1.2.12  107.62 KB  256 34.7%
  5258f178-b20e-408f-a7bf-b6da2903e026  rack1

 root@xxx-01:~# nodetool -h `hostname` compactionstats
 pending tasks: 1
   compaction typekeyspace   column family
 completed   total  unit  progress
 Active compaction remaining time :n/a

 root@xxx-01:~# nodetool -h `hostname` netstats
 Mode: NORMAL
 Not sending any streams.
 Not receiving any streams.
 Read Repair Statistics:
 Attempted: 0
 Mismatch (Blocking): 0
 Mismatch (Background): 0
 Pool NameActive   Pending  Completed
 Commandsn/a 0  57155
 Responses   n/a 0  14573




image.png

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Sylvain Lebresne

Then I suspect that this is artifact of your test methodology. Prepared
statements *are* faster than non prepared ones in general. They save some
parsing and some bytes on the wire. The savings will tend to be bigger for
bigger queries, and it's possible that for very small queries (like the one
you
are testing) the performance difference is somewhat negligible, but seeing
non
prepared statement being significantly faster than prepared ones almost
surely
means you're doing wrong (of course, a bug in either the driver or C* is
always
possible, and always make sure to test recent versions, but I'm not aware of
any such bug).

Are you sure you are warming up the JVMs (client and drivers) properly for
instance. 1000 iterations is *really small*, if you're not warming things
up properly, you're not measuring anything relevant. Also, are you including
the preparation of the query itself in the timing? Preparing a query is not
particulary fast, but it's meant to be done just once at the begining of the
application lifetime. But with only 1000 iterations, if you include the
preparation in the timing, it's entirely possible it's eating a good chunk
of
the whole time.

But other prepared versus non-prepared, you won't get proper performance
unless
you parallelize your inserts. Unlogged batches is one way to do it (it's
really
all Cassandra does with unlogged batch, parallelizing). But as John Sanda
mentioned, another option is to do the parallelization client side, with
executeAsync.

--
Sylvain



On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
 i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES (
 + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing
 the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
 Integration

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Robert Wille

I use hand-rolled batches a lot. You can get a *lot* of performance
improvement. Just make sure to sanitize your strings.

I¹ve been wondering, what¹s the limit, practical or hard, on the length of
a query?

Robert

On 12/11/13, 3:37 AM, David Tinker david.tin...@gmail.com wrote:

Yes thats what I found.

This is faster:

for (int i = 0; i  1000; i++) session.execute(INSERT INTO
test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

Than this:

def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
?))
for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
i] as Object[]))

This is the fastest option of all (hand rolled batch):

StringBuilder b = new StringBuilder()
b.append(BEGIN UNLOGGED BATCH\n)
for (int i = 0; i  1000; i++) {
b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
VALUES (').append(i).append(',')
.append(aa).append(i).append(')\n)
}
b.append(APPLY BATCH\n)
session.execute(b.toString())


On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
wrote:

 This loop takes 2500ms or so on my test cluster:

 PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
 (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
i));

 The same loop with the parameters inline is about 1300ms. It gets
 worse if there are many parameters.


 Do you mean that:
   for (int i = 0; i  1000; i++)
   session.execute(INSERT INTO perf_test.wibble (id, info) VALUES
( + i
 + , aa + i + ));
 is twice as fast as using a prepared statement? And that the difference
 is even greater if you add more columns than id and info?

 That would certainly be unexpected, are you sure you're not
re-preparing the
 statement every time in the loop?

 --
 Sylvain

 I know I can use batching to
 insert all the rows at once but thats not the purpose of this test. I
 also tried using session.execute(cql, params) and it is faster but
 still doesn't match inline values.

 Composing CQL strings is certainly convenient and simple but is there
 a much faster way?

 Thanks
 David

 I have also posted this on Stackoverflow if anyone wants the points:

 
http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-g
et-data-into-cassandra-2-from-a-java-application





-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
Integration

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Robert Wille

Network latency is the reason why the batched query is fastest. One trip to
Cassandra versus 1000. If you execute the inserts in parallel, then that
eliminates the latency issue.

From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, December 11, 2013 at 5:40 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: What is the fastest way to get data into Cassandra 2 from a
Java application?

Then I suspect that this is artifact of your test methodology. Prepared
statements *are* faster than non prepared ones in general. They save some
parsing and some bytes on the wire. The savings will tend to be bigger for
bigger queries, and it's possible that for very small queries (like the one
you
are testing) the performance difference is somewhat negligible, but seeing
non
prepared statement being significantly faster than prepared ones almost
surely
means you're doing wrong (of course, a bug in either the driver or C* is
always
possible, and always make sure to test recent versions, but I'm not aware of
any such bug).

Are you sure you are warming up the JVMs (client and drivers) properly for
instance. 1000 iterations is *really small*, if you're not warming things
up properly, you're not measuring anything relevant. Also, are you including
the preparation of the query itself in the timing? Preparing a query is not
particulary fast, but it's meant to be done just once at the begining of the
application lifetime. But with only 1000 iterations, if you include the
preparation in the timing, it's entirely possible it's eating a good chunk
of
the whole time.

But other prepared versus non-prepared, you won't get proper performance
unless
you parallelize your inserts. Unlogged batches is one way to do it (it's
really
all Cassandra does with unlogged batch, parallelizing). But as John Sanda
mentioned, another option is to do the parallelization client side, with
executeAsync. 

--
Sylvain

On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com
wrote:
 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())

 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:

  This loop takes 2500ms or so on my test cluster:

  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa + 
i));

  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.

  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( +
i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?

  That would certainly be unexpected, are you sure you're not re-preparing 
the
  statement every time in the loop?

  --
  Sylvain

  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.

  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?

  Thanks
  David

  I have also posted this on Stackoverflow if anyone wants the points:

 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-d
 ata-into-cassandra-2-from-a-java-application

 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration

Re: Try to configure commitlog_archiving.properties

2013-12-11 Thread Bonnet Jonathan .

Bonnet Jonathan jonathan.bonnet at externe.bnpparibas.com writes:

 
 Thanks a lot,
 
It Works, i see commit log bein archived. I'll try tomorrow the restore 
 command. Thanks again.
 
 Bonnet Jonathan.
 
 

Hello,

  I have restart a node today, and i have an error which seems to be in
relation with commitlog archiving:

ERROR 14:39:00,435 Exception encountered during startup
java.lang.RuntimeException: java.io.IOException: 
Cannot run program :
error=2, No such file or directory
at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:172)
at
org.apache.cassandra.db.commitlog.CommitLog.recover
(CommitLog.java:104)
at
org.apache.cassandra.service.CassandraDaemon.setup
(CassandraDaemon.java:305)
at
org.apache.cassandra.service.CassandraDaemon.activate
(CassandraDaemon.java:461)
at
org.apache.cassandra.service.CassandraDaemon.main
(CassandraDaemon.java:504)
Caused by: java.io.IOException: Cannot run program : error=2, No such file
or directory
at java.lang.ProcessBuilder.start(Unknown Source)
at 
org.apache.cassandra.utils.FBUtilities.exec(FBUtilities.java:588)
at
org.apache.cassandra.db.commitlog.CommitLogArchiver.exec
(CommitLogArchiver.java:182)
at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:168)
... 4 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.init(Unknown Source)
at java.lang.ProcessImpl.start(Unknown Source)
... 8 more
java.lang.RuntimeException: java.io.IOException: Cannot run program :
error=2, No such file or directory
at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:172)
at
org.apache.cassandra.db.commitlog.CommitLog.recover
(CommitLog.java:104)
at
org.apache.cassandra.service.CassandraDaemon.setup
(CassandraDaemon.java:305)
at
org.apache.cassandra.service.CassandraDaemon.activate
(CassandraDaemon.java:461)
at
org.apache.cassandra.service.CassandraDaemon.main
(CassandraDaemon.java:504)
Caused by: java.io.IOException: Cannot run program : error=2, 
No such file or directory
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.cassandra.utils.FBUtilities.exec
(FBUtilities.java:588)
at
org.apache.cassandra.db.commitlog.CommitLogArchiver.exec
(CommitLogArchiver.java:182)
at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:168)
... 4 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.init(Unknown Source)
at java.lang.ProcessImpl.start(Unknown Source)
... 8 more



No help again on the net, nothing change since the last changes in
commitlog_archiving.properties.  The first time yesterday that i restart
there was no problem,and my commitlog bein archived well.

Someone can help me, please ?

Regards,

Bonnet Jonathan.

Re: Try to configure commitlog_archiving.properties

2013-12-11 Thread Artur Kronenberg


hi Bonnet,

that doesn't seem to be a problem with your archiving, rather with the 
restoring. What is your restore command?


-- artur

On 11/12/13 13:47, Bonnet Jonathan. wrote:

Bonnet Jonathan jonathan.bonnet at externe.bnpparibas.com writes:



Thanks a lot,

It Works, i see commit log bein archived. I'll try tomorrow the restore
command. Thanks again.

Bonnet Jonathan.



Hello,

   I have restart a node today, and i have an error which seems to be in
relation with commitlog archiving:

ERROR 14:39:00,435 Exception encountered during startup
java.lang.RuntimeException: java.io.IOException:
Cannot run program :
error=2, No such file or directory
 at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:172)
 at
org.apache.cassandra.db.commitlog.CommitLog.recover
(CommitLog.java:104)
 at
org.apache.cassandra.service.CassandraDaemon.setup
(CassandraDaemon.java:305)
 at
org.apache.cassandra.service.CassandraDaemon.activate
(CassandraDaemon.java:461)
 at
org.apache.cassandra.service.CassandraDaemon.main
(CassandraDaemon.java:504)
Caused by: java.io.IOException: Cannot run program : error=2, No such file
or directory
 at java.lang.ProcessBuilder.start(Unknown Source)
 at
org.apache.cassandra.utils.FBUtilities.exec(FBUtilities.java:588)
 at
org.apache.cassandra.db.commitlog.CommitLogArchiver.exec
(CommitLogArchiver.java:182)
 at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:168)
 ... 4 more
Caused by: java.io.IOException: error=2, No such file or directory
 at java.lang.UNIXProcess.forkAndExec(Native Method)
 at java.lang.UNIXProcess.init(Unknown Source)
 at java.lang.ProcessImpl.start(Unknown Source)
 ... 8 more
java.lang.RuntimeException: java.io.IOException: Cannot run program :
error=2, No such file or directory
 at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:172)
 at
org.apache.cassandra.db.commitlog.CommitLog.recover
(CommitLog.java:104)
 at
org.apache.cassandra.service.CassandraDaemon.setup
(CassandraDaemon.java:305)
 at
org.apache.cassandra.service.CassandraDaemon.activate
(CassandraDaemon.java:461)
 at
org.apache.cassandra.service.CassandraDaemon.main
(CassandraDaemon.java:504)
Caused by: java.io.IOException: Cannot run program : error=2,
No such file or directory
 at java.lang.ProcessBuilder.start(Unknown Source)
 at org.apache.cassandra.utils.FBUtilities.exec
(FBUtilities.java:588)
 at
org.apache.cassandra.db.commitlog.CommitLogArchiver.exec
(CommitLogArchiver.java:182)
 at
org.apache.cassandra.db.commitlog.CommitLogArchiver.maybeRestoreArchive
(CommitLogArchiver.java:168)
 ... 4 more
Caused by: java.io.IOException: error=2, No such file or directory
 at java.lang.UNIXProcess.forkAndExec(Native Method)
 at java.lang.UNIXProcess.init(Unknown Source)
 at java.lang.ProcessImpl.start(Unknown Source)
 ... 8 more



No help again on the net, nothing change since the last changes in
commitlog_archiving.properties.  The first time yesterday that i restart
there was no problem,and my commitlog bein archived well.

Someone can help me, please ?

Regards,

Bonnet Jonathan.

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Sylvain Lebresne

On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote:

 Network latency is the reason why the batched query is fastest. One trip
 to Cassandra versus 1000. If you execute the inserts in parallel, then that
 eliminates the latency issue.


While it is true a batch will means only one client-server round trip, I'll
note that provided you use the TokenAware load balancing policy, doing the
parallelization client will save you intra-replica round-trips, which using
a big batch won't. So that it might not be all that clear which ones is
faster. And very large batches have the disadvantage that your are more
likely to get a timeout (and if you do, you have to retry the whole batch,
even though most of it has probably be inserted correctly). Overall, the
best option probably has to do with parallelizing the inserts of reasonably
sized batches, but what are the sizes for that is likely very use case
dependent, you'll have to test.

--
Sylvain




 From: Sylvain Lebresne sylv...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Wednesday, December 11, 2013 at 5:40 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: What is the fastest way to get data into Cassandra 2 from a
 Java application?

 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the
 one you
 are testing) the performance difference is somewhat negligible, but seeing
 non
 prepared statement being significantly faster than prepared ones almost
 surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware
 of
 any such bug).

 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you
 including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of
 the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk
 of
 the whole time.

 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync.

 --
 Sylvain



 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.comwrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
 i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES
 ( + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not
 re-preparing the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
 http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread David Tinker

I didn't do any warming up etc. I am new to Cassandra and was just
poking around with some scripts to try to find the fastest way to do
things. That said all the mini-tests ran under the same conditions.

In our case the batches will have a variable number of different
inserts/updates in them so doing a whole batch as a PreparedStatement
won't help. However using BatchStatement and stuffing it full of
repeated PreparedStatement's might be better than a batch with inlined
parameters. I will do a test of that and see. I will also let the VM
warm up and whatnot this time.



On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the one
 you
 are testing) the performance difference is somewhat negligible, but seeing
 non
 prepared statement being significantly faster than prepared ones almost
 surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware of
 any such bug).

 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk
 of
 the whole time.

 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync.

 --
 Sylvain



 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com
 wrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
  i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES (
  + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing
  the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
  http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
 Integration





-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration

Re: Try to configure commitlog_archiving.properties

2013-12-11 Thread Bonnet Jonathan .

Artur Kronenberg artur.kronenberg at openmarket.com writes:


 
 hi Bonnet,
   that doesn't seem to be a problem with your archiving, rather with
   the restoring. What is your restore command? 
   -- artur
   On 11/12/13 13:47, Bonnet Jonathan. wrote:
 
 
   
Thanks to answear so fast,

I put nothing for restore ? should I ? cause i don't want to restore for the
moment.

Regards,

Data tombstoned during bulk loading 1.2.10 - 2.0.3

2013-12-11 Thread Mathijs Vogelzang

Hi all,

We're running into a weird problem trying to migrate our data from a
1.2.10 cluster to a 2.0.3 one.

I've taken a snapshot on the old cluster, and for each host there, I'm running
sstableloader -d host of new cluster KEYSPACE/COLUMNFAMILY
(the sstableloader process from the 2.0.3 distribution, the one from
1.2.10 only gets java.lang.RuntimeException: java.io.IOException:
Connection reset by peer)

it then copies the data successfully but when checking the data i
noticed some rows seem to be missing. It turned out the data is not
missing, but has been tombstoned.
When I use sstable2json on the sstable on the destination cluster, it has
metadata: {deletionInfo:
{markedForDeleteAt:1796952039620607,localDeletionTime:0}}, whereas
it doesn't have that in the source sstable.
(Yes, this is a timestamp far into the future. All our hosts are
properly synced through ntp).

This has happened for a bunch of random rows. How is this possible?
Naturally, copying the data again doesn't work to fix it, as the
tombstone is far in the future. Apart from not having this happen at
all, how can it be fixed?

Best regards,

Mathijs

Re: Try to configure commitlog_archiving.properties

2013-12-11 Thread Artur Kronenberg


So, looking at the code:

public void maybeRestoreArchive()
{
if (Strings.isNullOrEmpty(restoreDirectories))
return;

for (String dir : restoreDirectories.split(,))
{
File[] files = new File(dir).listFiles();
if (files == null)
{
throw new RuntimeException(Unable to list director  + 
dir);

}
for (File fromFile : files)
{
File toFile = new 
File(DatabaseDescriptor.getCommitLogLocation(), new 
CommitLogDescriptor(CommitLogSegment.getNextId()).fileName());
String command = restoreCommand.replace(%from, 
fromFile.getPath());

command = command.replace(%to, toFile.getPath());
try
{
exec(command);
}
catch (IOException e)
{
throw new RuntimeException(e);
}
}
}
}


I would like someone to confirm that, but it might potentially be a bug. 
It does the right thing for an empty restore directory. However it 
ignores the fact that the restore command could be empty.
So for you, jonathan, I reckon you have the restore directory set? You 
don't need that to be set in order to archive (only if you want to 
restore it). So set your restore_directory property to empty and you 
should get rid of those errors. The directory needs to be set when you 
enable the restore command.


On a second look, I am almost certain this is a bug, as the maybeArchive 
command does correctly check for the command to not be empty or null. 
The maybeRestore command needs to do the same thing for the 
restoreCommand. If someone confirms, I am happy to raise a bug.


cheers,

artur

On 11/12/13 14:09, Bonnet Jonathan. wrote:

Artur Kronenberg artur.kronenberg at openmarket.com writes:



 hi Bonnet,
   that doesn't seem to be a problem with your archiving, rather with
   the restoring. What is your restore command?
   -- artur
   On 11/12/13 13:47, Bonnet Jonathan. wrote:

 
   

Thanks to answear so fast,

I put nothing for restore ? should I ? cause i don't want to restore for the
moment.

Regards,

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-11 Thread Robert Wille

Very good point. I¹ve written code to do a very large number of inserts, but
I¹ve only ever run it on a single-node cluster. I may very well find out
when I run it against a multinode cluster that the performance benefits of
large unlogged batches mostly go away.

From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, December 11, 2013 at 6:52 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: What is the fastest way to get data into Cassandra 2 from a
Java application?

On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille rwi...@fold3.com wrote:
 Network latency is the reason why the batched query is fastest. One trip to
 Cassandra versus 1000. If you execute the inserts in parallel, then that
 eliminates the latency issue.

While it is true a batch will means only one client-server round trip, I'll
note that provided you use the TokenAware load balancing policy, doing the
parallelization client will save you intra-replica round-trips, which using
a big batch won't. So that it might not be all that clear which ones is
faster. And very large batches have the disadvantage that your are more
likely to get a timeout (and if you do, you have to retry the whole batch,
even though most of it has probably be inserted correctly). Overall, the
best option probably has to do with parallelizing the inserts of reasonably
sized batches, but what are the sizes for that is likely very use case
dependent, you'll have to test.

--
Sylvain

 From:  Sylvain Lebresne sylv...@datastax.com
 Reply-To:  user@cassandra.apache.org
 Date:  Wednesday, December 11, 2013 at 5:40 AM
 To:  user@cassandra.apache.org user@cassandra.apache.org
 Subject:  Re: What is the fastest way to get data into Cassandra 2 from a Java
 application?

 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the one
 you
 are testing) the performance difference is somewhat negligible, but seeing non
 prepared statement being significantly faster than prepared ones almost surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware of
 any such bug).

 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk of
 the whole time.

 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync. 

 --
 Sylvain

 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote:
 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())

 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:

  This loop takes 2500ms or so on my test cluster:

  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
 i));

  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.

  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( +
i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?

  That would certainly be unexpected, are you sure you're not re-preparing
 the
  statement every time in the loop?

  --
  Sylvain

  I know I can use

Re: Try to configure commitlog_archiving.properties

2013-12-11 Thread Bonnet Jonathan .

Thanks Artur,

You're right i must comment restore directory too.

Now i'll try to practice around restore.

Regards,

Bonnet Jonathan.

Re: How to create counter column family via Pycassa?

2013-12-11 Thread Kumar Ranjan

What are the all possible values for cf_kwargs ??

SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test',
comparator_type=UTF8Type,  )

 - Here I want to specify, Column data types and row key type. How can
I do that ?


On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.com wrote:

 The column_validation_classes arg is just for defining individual column
 types.  Glad you got it figured out, though.


 On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 Thanks for quick reply. Apparantly, I was trying this to get working

 cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE}
 sys.create_column_family('my_ks', 'vote_count',
 column_validation_classes=cf_kwargs)  #1

 But this works:

 sys.create_column_family('my_ks', 'vote_count', **cf_kwargs)  #2

 I thought #1 should work.



 On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.com wrote:

 The only thing that makes a CF a counter CF is that the default
 validation class is CounterColumnType, which you can set through
 SystemManager.create_column_family().


 On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 I do not find a way to create a counter column family in Pycassa.
 This[1] does not help.

 Appreciate if someone can help me.

 Thanks

  1.
 http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family




 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/

Re: How to create counter column family via Pycassa?

2013-12-11 Thread Tyler Hobbs

What options are available depends on what version of Cassandra you're
using.

You can specify the row key type with 'key_validation_class'.

For column types, use 'column_validation_classes', which is a dict mapping
column names to types.  For example:

sys.create_column_family('mykeyspace', 'users',
column_validation_classes={'username': UTF8Type, 'age': IntegerType})


On Wed, Dec 11, 2013 at 10:32 AM, Kumar Ranjan winnerd...@gmail.com wrote:

 What are the all possible values for cf_kwargs ??

 SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test',
 comparator_type=UTF8Type,  )

  - Here I want to specify, Column data types and row key type. How can
 I do that ?


 On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.com wrote:

 The column_validation_classes arg is just for defining individual column
 types.  Glad you got it figured out, though.


 On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 Thanks for quick reply. Apparantly, I was trying this to get working

 cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE}
 sys.create_column_family('my_ks', 'vote_count',
 column_validation_classes=cf_kwargs)  #1

 But this works:

 sys.create_column_family('my_ks', 'vote_count', **cf_kwargs)  #2

 I thought #1 should work.



 On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.com wrote:

 The only thing that makes a CF a counter CF is that the default
 validation class is CounterColumnType, which you can set through
 SystemManager.create_column_family().


 On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 I do not find a way to create a counter column family in Pycassa.
 This[1] does not help.

 Appreciate if someone can help me.

 Thanks

  1.
 http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family




 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/





-- 
Tyler Hobbs
DataStax http://datastax.com/

[no subject]

2013-12-11 Thread Kumar Ranjan

Hey Folks,

So I am creating, column family using pycassaShell. See below:

validators = {

'approved':  'BooleanType',

'text':  'UTF8Type',

'favorite_count':'IntegerType',

'retweet_count': 'IntegerType',

'expanded_url':  'UTF8Type',

'tuid':  'LongType',

'screen_name':   'UTF8Type',

'profile_image': 'UTF8Type',

'embedly_data':  'CompositeType',

'created_at':'UTF8Type',

}

SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test',
comparator_type='CompositeType', default_validation_class='UTF8Type',
key_validation_class='UTF8Type', column_validation_classes=validators)


I am getting this error:

*InvalidRequestException*: InvalidRequestException(why='Invalid definition
for comparator org.apache.cassandra.db.marshal.CompositeType.'


My data will look like this:

'row_key' : { 'tid' :

{

'expanded_url': u'http://instagram.com/p/hwDj2BJeBy/',

'text': '#snowinginNYC Makes me so happy\xe2\x9d\x840brittles0
\xe2\x9b\x84 @ Grumman Studios http://t.co/rlOvaYSfKa',

'profile_image': u'
https://pbs.twimg.com/profile_images/3262070059/1e82f895559b904945d28cd3ab3947e5_normal.jpeg
',

'tuid': 339322611,

'approved': 'true',

'favorite_count': 0,

'screen_name': u'LonaVigi',

'created_at': u'Wed Dec 11 01:10:05 + 2013',

'embedly_data': {u'provider_url': u'http://instagram.com/',
u'description': ulonavigi's photo on Instagram, u'title':
u'#snwinginNYC Makes me so happy\u2744@0brittles0 \u26c4', u'url': u'
http://distilleryimage7.ak.instagram.com/5b880dec61c711e3a50b129314edd3b_8.jpg',
u'thumbnail_width': 640, u'height': 640, u'width': 640, u'thumbnail_url': u'
http://distilleryimage7.ak.instagram.com/b880dec61c711e3a50b1293d14edd3b_8.jpg',
u'author_name': u'lonavigi', u'version': u'1.0', u'provider_name':
u'Instagram', u'type': u'poto', u'thumbnail_height': 640, u'author_url': u'
http://instagram.com/lonavigi'},

'tid': 410577192746500096,

'retweet_count': 0

}

}

Re: Cyclop - CQL3 web based editor

2013-12-11 Thread Parth Patil

Hi Maciej,
This looks great! Thanks for building this.


On Wed, Dec 11, 2013 at 12:45 AM, Murali muralidharan@gmail.com wrote:

 Hi Maciej,
 Thanks for sharing it.




 On Wed, Dec 11, 2013 at 2:09 PM, Maciej Miklas mac.mik...@gmail.comwrote:

 Hi all,

 This is the Cassandra mailing list, but I've developed something that is
 strictly related to Cassandra, and some of you might find it useful, so
 I've decided to send email to this group.

 This is web based CQL3 editor. The idea is, to deploy it once and have
 simple and comfortable CQL3 interface over web - without need to install
 anything.

 The editor itself supports code completion, not only based on CQL syntax,
 but also based database content - so for example the select statement will
 suggest tables from active keyspace, or in where closure only columns from
 table provided after select from

 The results are displayed in reversed table - rows horizontally and
 columns vertically. It seems to be more natural for column oriented
 database.

 You can also export query results to CSV, or add query as browser
 bookmark.

 The whole application is based on wicket + bootstrap + spring and can be
 deployed in any web 3.0 container.

 Here is the project (open source): https://github.com/maciejmiklas/cyclop


 Have a fun!
  Maciej




 --
 Thanks,
 Murali
 99025-5




-- 
Best,
Parth

Re: How to create counter column family via Pycassa?

2013-12-11 Thread Kumar Ranjan

validators = {

'approved':  'BooleanType',

'text':  'UTF8Type',

'favorite_count':'IntegerType',

'retweet_count': 'IntegerType',

'expanded_url':  'UTF8Type',

'tuid':  'LongType',

'screen_name':   'UTF8Type',

'profile_image': 'UTF8Type',

'embedly_data':  'CompositeType',

'created_at':'UTF8Type',

}

SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test',
comparator_type='CompositeType', default_validation_class='UTF8Type',
key_validation_class='UTF8Type', column_validation_classes=validators)


throws:

*InvalidRequestException*: InvalidRequestException(why='Invalid definition
for comparator org.apache.cassandra.db.marshal.CompositeType.

 Can you please explain why?


On Wed, Dec 11, 2013 at 12:08 PM, Tyler Hobbs ty...@datastax.com wrote:

 What options are available depends on what version of Cassandra you're
 using.

 You can specify the row key type with 'key_validation_class'.

 For column types, use 'column_validation_classes', which is a dict mapping
 column names to types.  For example:

 sys.create_column_family('mykeyspace', 'users',
 column_validation_classes={'username': UTF8Type, 'age': IntegerType})


 On Wed, Dec 11, 2013 at 10:32 AM, Kumar Ranjan winnerd...@gmail.comwrote:

 What are the all possible values for cf_kwargs ??

 SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test',
 comparator_type=UTF8Type,  )

  - Here I want to specify, Column data types and row key type. How
 can I do that ?


 On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.com wrote:

 The column_validation_classes arg is just for defining individual column
 types.  Glad you got it figured out, though.


 On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 Thanks for quick reply. Apparantly, I was trying this to get working

 cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE}
 sys.create_column_family('my_ks', 'vote_count',
 column_validation_classes=cf_kwargs)  #1

 But this works:

 sys.create_column_family('my_ks', 'vote_count', **cf_kwargs)  #2

 I thought #1 should work.



 On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.comwrote:

 The only thing that makes a CF a counter CF is that the default
 validation class is CounterColumnType, which you can set through
 SystemManager.create_column_family().


 On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 I do not find a way to create a counter column family in Pycassa.
 This[1] does not help.

 Appreciate if someone can help me.

 Thanks

  1.
 http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family




 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/

Re: How to create counter column family via Pycassa?

2013-12-11 Thread Kumar Ranjan

This works, When I remove the comparator_type

validators = {

'tid':   'IntegerType',

'approved':  'BooleanType',

'text':  'UTF8Type',

'favorite_count':'IntegerType',

'retweet_count': 'IntegerType',

'expanded_url':  'UTF8Type',

'tuid':  'LongType',

'screen_name':   'UTF8Type',

'profile_image': 'UTF8Type',

'embedly_data':  'BytesType',

'created_at':'UTF8Type',

}


SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search',
default_validation_class='UTF8Type', key_validation_class='UTF8Type',
column_validation_classes=validators)




On Wed, Dec 11, 2013 at 12:23 PM, Kumar Ranjan winnerd...@gmail.com wrote:

 I am using ccm cassandra version

 *1.2.11*


 On Wed, Dec 11, 2013 at 12:19 PM, Kumar Ranjan winnerd...@gmail.comwrote:

 validators = {

 'approved':  'BooleanType',

 'text':  'UTF8Type',

 'favorite_count':'IntegerType',

 'retweet_count': 'IntegerType',

 'expanded_url':  'UTF8Type',

 'tuid':  'LongType',

 'screen_name':   'UTF8Type',

 'profile_image': 'UTF8Type',

 'embedly_data':  'CompositeType',

 'created_at':'UTF8Type',

 }

 SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test',
 comparator_type='CompositeType', default_validation_class='UTF8Type',
 key_validation_class='UTF8Type', column_validation_classes=validators)


 throws:

 *InvalidRequestException*: InvalidRequestException(why='Invalid
 definition for comparator org.apache.cassandra.db.marshal.CompositeType.

  Can you please explain why?


 On Wed, Dec 11, 2013 at 12:08 PM, Tyler Hobbs ty...@datastax.com wrote:

 What options are available depends on what version of Cassandra you're
 using.

 You can specify the row key type with 'key_validation_class'.

 For column types, use 'column_validation_classes', which is a dict
 mapping column names to types.  For example:

 sys.create_column_family('mykeyspace', 'users',
 column_validation_classes={'username': UTF8Type, 'age': IntegerType})


 On Wed, Dec 11, 2013 at 10:32 AM, Kumar Ranjan winnerd...@gmail.comwrote:

 What are the all possible values for cf_kwargs ??

 SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test',
 comparator_type=UTF8Type,  )

  - Here I want to specify, Column data types and row key type. How
 can I do that ?


 On Thu, Aug 15, 2013 at 12:30 PM, Tyler Hobbs ty...@datastax.comwrote:

 The column_validation_classes arg is just for defining individual
 column types.  Glad you got it figured out, though.


 On Thu, Aug 15, 2013 at 11:23 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 Thanks for quick reply. Apparantly, I was trying this to get working

 cf_kwargs = {'default_validation_class':COUNTER_COLUMN_TYPE}
 sys.create_column_family('my_ks', 'vote_count',
 column_validation_classes=cf_kwargs)  #1

 But this works:

 sys.create_column_family('my_ks', 'vote_count', **cf_kwargs)  #2

 I thought #1 should work.



 On Thu, Aug 15, 2013 at 9:15 PM, Tyler Hobbs ty...@datastax.comwrote:

 The only thing that makes a CF a counter CF is that the default
 validation class is CounterColumnType, which you can set through
 SystemManager.create_column_family().


 On Thu, Aug 15, 2013 at 10:38 AM, Pinak Pani 
 nishant.has.a.quest...@gmail.com wrote:

 I do not find a way to create a counter column family in Pycassa.
 This[1] does not help.

 Appreciate if someone can help me.

 Thanks

  1.
 http://pycassa.github.io/pycassa/api/pycassa/system_manager.html#pycassa.system_manager.SystemManager.create_column_family




 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/

Bulkoutputformat

2013-12-11 Thread varun allampalli

Hi All,

I want to bulk insert data into cassandra. I was wondering of using
BulkOutputformat in hadoop. Is it the best way or using driver and doing
batch insert is the better way.

Are there any disandvantages of using bulkoutputformat.

Thanks for helping

Varun

efficient way to store 8-bit or 16-bit value?

2013-12-11 Thread onlinespending

What do people recommend I do to store a small binary value in a column? I’d 
rather not simply use a 32-bit int for a single byte value. Can I have a one 
byte blob? Or should I store it as a single character ASCII string? I imagine 
each is going to have the overhead of storing the length (or null termination 
in the case of a string). That overhead may be worse than simply using a 32-bit 
int.

Also is it possible to partition on a single character or substring of 
characters from a string (or a portion of a blob)? Something like:

CREATE TABLE test (
id text,
value blob,
PRIMARY KEY (string[0:1])
)

Re: efficient way to store 8-bit or 16-bit value?

2013-12-11 Thread Andrey Ilinykh

Column metadata is about 20 bytes. So, there is no big difference if you
save 1 or 4 bytes.

Thank you,
  Andrey


On Wed, Dec 11, 2013 at 2:42 PM, onlinespending onlinespend...@gmail.comwrote:

 What do people recommend I do to store a small binary value in a column?
 I’d rather not simply use a 32-bit int for a single byte value. Can I have
 a one byte blob? Or should I store it as a single character ASCII string? I
 imagine each is going to have the overhead of storing the length (or null
 termination in the case of a string). That overhead may be worse than
 simply using a 32-bit int.

 Also is it possible to partition on a single character or substring of
 characters from a string (or a portion of a blob)? Something like:

 CREATE TABLE test (
 id text,
 value blob,
 PRIMARY KEY (string[0:1])
 )

Re: nodetool repair keeping an empty cluster busy

2013-12-11 Thread Robert Coli

On Wed, Dec 11, 2013 at 1:35 AM, Sven Stark sven.st...@m-square.com.auwrote:

 thanks for replying. Could you please be a bit more specific, though. Eg
 what exactly is being compacted - there is/was no data at all in the
 cluster save for a few hundred kB in the system CF (see the nodetool status
 output). Or - how can those few hundred kB in data generate Gb of network
 traffic?


The only answer I can come up with is that the Merkle trees generated and
compared by repair are of a fixed size, and don't scale with the data
present in the cluster. While I'm pretty sure each node can be aware that
it has little to no data to repair, it generates and compares the trees
anyway. It's a bit surprising that this might be Gbs of network traffic...

The system keyspace will always have some data in it, have you tried only
compacting your empty keyspace instead of the whole node?

If so, and it exhibits the same behavior, that seems like a bug or at least
unexpected behavior to me. If you're running a modern version of Cassandra,
I would file a JIRA.

=Rob

Re: AddContractPoint /VIP

2013-12-11 Thread Aaron Morton

 What is the good practice to put in the code as addContactPoint ie.,how many 
 servers ?
I use the same nodes as the seed list nodes for that DC. 

The idea of the seed list is that it’s a list of well known nodes, and it’s 
easier operationally to say we have one list of well known nodes that is used 
by the servers and the clients. 

 1) I am also thinking to put this way here   I am not sure this good or bad 
 if i conigure   4 serves into one VIP ( virtual IP/virtual DNS)
 and specifying that DSN in the code as ContactPoint,  so that that VIP is 
 smart enough to route to different nodes.
Too complicated. 

 2) Is that problem if i use multiple Data centers in future ?
You only need to give the client the local seeds, it will discover all the 
nodes. 

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 7/12/2013, at 7:12 am, chandra Varahala hadoopandcassan...@gmail.com wrote:

 Greetings,
 
 I have 4 node cassandra cluster that will grow upt to 10 nodes,we are using  
 CQL  Java client to access the  data.
 What is the good practice to put in the code as addContactPoint ie.,how many 
 servers ?
 
 1) I am also thinking to put this way here   I am not sure this good or bad 
 if i conigure   4 serves into one VIP ( virtual IP/virtual DNS)
 and specifying that DSN in the code as ContactPoint,  so that that VIP is 
 smart enough to route to different nodes.
 
 2) Is that problem if i use multiple Data centers in future ?
 
 
 thanks
 Chandra

Re: Write performance with 1.2.12

2013-12-11 Thread Aaron Morton

 Changed memtable_total_space_in_mb to 1024 still no luck.
Reducing memtable_total_space_in_mb will increase the frequency of flushing to 
disk, which will create more for compaction to do and result in increased IO. 

You should return it to the default.

 when I send traffic to one node its performance is 2x more than when I send 
 traffic to all the nodes.
  
What are you measuring, request latency or local read/write latency ? 

If it’s write latency it’s probably GC, if it’s read is probably IO or data 
model. 

Hope that helps. 

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote:

 Changed memtable_total_space_in_mb to 1024 still no luck.
 
 
 On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote:
 Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 
 which is 8/3 ~ 2.6 gb in capacity
 http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management
 
 The flushing of 2.6 gb to the disk might slow the performance if frequently 
 called, may be you have lots of write operations going on.
 
 
 
 On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote:
 
 
 
 On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote:
 You have passed the JVM configurations and not the cassandra configurations 
 which is in cassandra.yaml.
 
 Apologies, was tuning JVM and that's what was in my mind. 
 Here are the cassandra settings http://pastebin.com/uN42GgYT
 
  
 The spikes are not that significant in our case and we are running the 
 cluster with 1.7 gb heap.
 
 Are these spikes causing any issue at your end?
 
 There are no big spikes, the overall performance seems to be about 40% low.
  
 
 
 
 
 On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote:
 
 
 
 On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote:
 Hard to say much without knowing about the cassandra configurations.
  
 The cassandra configuration is 
 -Xms8G
 -Xmx8G
 -Xmn800m
 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=4
 -XX:MaxTenuringThreshold=2
 -XX:CMSInitiatingOccupancyFraction=75
 -XX:+UseCMSInitiatingOccupancyOnly
 
  
 Yes compactions/GC's could skipe the CPU, I had similar behavior with my 
 setup.
 
 Were you able to get around it ?
  
 
 -VK
 
 
 On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote:
 We have a 3 node cluster running cassandra 1.2.12, they are pretty big 
 machines 64G ram with 16 cores, cassandra heap is 8G. 
 
 The interesting observation is that, when I send traffic to one node its 
 performance is 2x more than when I send traffic to all the nodes. We ran 
 1.0.11 on the same box and we observed a slight dip but not half as seen with 
 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to 
 ONE make a slight improvement but not much.
 
 The read_Repair_chance is 0.1. We see some compactions running.
 
 following is my iostat -x output, sda is the ssd (for commit log) and sdb is 
 the spinner.
 
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   66.460.008.950.010.00   24.58
 
 Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz 
 avgqu-sz   await  svctm  %util
 sda   0.0027.60  0.00  4.40 0.00   256.0058.18 
 0.012.55   1.32   0.58
 sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00 
 0.000.00   0.00   0.00
 sda2  0.0027.60  0.00  4.40 0.00   256.0058.18 
 0.012.55   1.32   0.58
 sdb   0.00 0.00  0.00  0.00 0.00 0.00 0.00 
 0.000.00   0.00   0.00
 sdb1  0.00 0.00  0.00  0.00 0.00 0.00 0.00 
 0.000.00   0.00   0.00
 dm-0  0.00 0.00  0.00  0.00 0.00 0.00 0.00 
 0.000.00   0.00   0.00
 dm-1  0.00 0.00  0.00  0.60 0.00 4.80 8.00 
 0.005.33   2.67   0.16
 dm-2  0.00 0.00  0.00  0.00 0.00 0.00 0.00 
 0.000.00   0.00   0.00
 dm-3  0.00 0.00  0.00 24.80 0.00   198.40 8.00 
 0.249.80   0.13   0.32
 dm-4  0.00 0.00  0.00  6.60 0.0052.80 8.00 
 0.011.36   0.55   0.36
 dm-5  0.00 0.00  0.00  0.00 0.00 0.00 0.00 
 0.000.00   0.00   0.00
 dm-6  0.00 0.00  0.00 24.80 0.00   198.40 8.00 
 0.29   11.60   0.13   0.32
 
 
 
 I can see I am cpu bound here but couldn't figure out exactly what is causing 
 it, is this caused by GC or Compaction ? I am thinking it is compaction, I 
 see a lot of context switches and interrupts in my vmstat output.
 
 I don't see GC activity in the logs but see some compaction activity. Has 
 anyone seen this ?

Re: OOMs during high (read?) load in Cassandra 1.2.11

2013-12-11 Thread Aaron Morton

Do you have the back trace for from the heap dump so we can see what the array 
was and what was using it ?

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 10/12/2013, at 4:41 am, Klaus Brunner klaus.brun...@gmail.com wrote:

 2013/12/9 Nate McCall n...@thelastpickle.com:
 Do you have any secondary indexes defined in the schema? That could lead to
 a 'mega row' pretty easily depending on the cardinality of the value.
 
 That's an interesting point - but no, we don't have any secondary
 indexes anywhere. From the heap dump, it's fairly evident that it's
 not a single huge row but actually many rows.
 
 I'll keep watching if this occurs again, or if the compaction fixed it for 
 good.
 
 Thanks,
 
 Klaus

Re: Data Modelling Information

2013-12-11 Thread Aaron Morton

 create table messages(
   body text,
   username text,
   tags settext
   PRIMARY keys(username,tags)
 )

This statement is syntactically invalid, also you cannot use a collection type 
in the primary key. 

 1) I should be able to query by username and get all the messages for a 
 particular username

yes. 

 2) I should be able to query by tags and username ( likes select * from 
 messages where username='xya' and tags in ('awesome','phone'))
No.

 3) I should be able to query all messages by day and  order by desc and limit 
 to some value
No.

 Could you guys please let me know if creating a secondary index on the tags 
 field?

No, it’s not supported. 


 Or what would be the best way to model this data.

You need to describe the problem and how you want to read the data. I suggest 
taking a look at the data modelling videos from Patrick here 
http://planetcassandra.org/Learn/CassandraCommunityWebinars

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 10/12/2013, at 8:57 am, Shrikar archak shrika...@gmail.com wrote:

 Hi Data Model Experts,
 I have a few questions with data modelling for a particular application.
 
 example
 create table messages(
   body text,
   username text,
   tags settext
   PRIMARY keys(username,tags)
 )
 
 
 Requirements
 1) I should be able to query by username and get all the messages for a 
 particular username
 2) I should be able to query by tags and username ( likes select * from 
 messages where username='xya' and tags in ('awesome','phone'))
 3) I should be able to query all messages by day and  order by desc and limit 
 to some value
 
 
 Could you guys please let me know if creating a secondary index on the tags 
 field?
 Or what would be the best way to model this data.
 
 Thanks,
 Shrikar

Re: Write performance with 1.2.12

2013-12-11 Thread srmore

Thanks Aaron


On Wed, Dec 11, 2013 at 8:15 PM, Aaron Morton aa...@thelastpickle.comwrote:

 Changed memtable_total_space_in_mb to 1024 still no luck.

 Reducing memtable_total_space_in_mb will increase the frequency of
 flushing to disk, which will create more for compaction to do and result in
 increased IO.

 You should return it to the default.


You are right, had to revert it back to default.



 when I send traffic to one node its performance is 2x more than when I
 send traffic to all the nodes.



 What are you measuring, request latency or local read/write latency ?

 If it’s write latency it’s probably GC, if it’s read is probably IO or
 data model.


It is the write latency, read latency is ok. Interestingly the latency is
low when there is one node. When I join other nodes the latency drops about
1/3. To be specific, when I start sending traffic to the other nodes the
latency for all the nodes increases, if I stop traffic to other nodes the
latency drops again, I checked, this is not node specific it happens to any
node.

I don't see any GC activity in logs. Tried to control the compaction by
reducing the number of threads, did not help much.


 Hope that helps.

 -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote:

 Changed memtable_total_space_in_mb to 1024 still no luck.


 On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote:

 Can you set the memtable_total_space_in_mb value, it is defaulting to
 1/3 which is 8/3 ~ 2.6 gb in capacity

 http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management

 The flushing of 2.6 gb to the disk might slow the performance if
 frequently called, may be you have lots of write operations going on.



 On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote:




 On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote:

 You have passed the JVM configurations and not the cassandra
 configurations which is in cassandra.yaml.


 Apologies, was tuning JVM and that's what was in my mind.
 Here are the cassandra settings http://pastebin.com/uN42GgYT



 The spikes are not that significant in our case and we are running the
 cluster with 1.7 gb heap.

 Are these spikes causing any issue at your end?


 There are no big spikes, the overall performance seems to be about 40%
 low.






 On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote:




 On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote:

 Hard to say much without knowing about the cassandra configurations.


 The cassandra configuration is
 -Xms8G
 -Xmx8G
 -Xmn800m
 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=4
 -XX:MaxTenuringThreshold=2
 -XX:CMSInitiatingOccupancyFraction=75
 -XX:+UseCMSInitiatingOccupancyOnly



 Yes compactions/GC's could skipe the CPU, I had similar behavior with
 my setup.


 Were you able to get around it ?



 -VK


 On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote:

 We have a 3 node cluster running cassandra 1.2.12, they are pretty
 big machines 64G ram with 16 cores, cassandra heap is 8G.

 The interesting observation is that, when I send traffic to one node
 its performance is 2x more than when I send traffic to all the nodes. We
 ran 1.0.11 on the same box and we observed a slight dip but not half as
 seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM.
 Changing CL to ONE make a slight improvement but not much.

 The read_Repair_chance is 0.1. We see some compactions running.

 following is my iostat -x output, sda is the ssd (for commit log)
 and sdb is the spinner.

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   66.460.008.950.010.00   24.58

 Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda   0.0027.60  0.00  4.40 0.00   256.00
 58.18 0.012.55   1.32   0.58
 sda1  0.00 0.00  0.00  0.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 sda2  0.0027.60  0.00  4.40 0.00   256.00
 58.18 0.012.55   1.32   0.58
 sdb   0.00 0.00  0.00  0.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 sdb1  0.00 0.00  0.00  0.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-0  0.00 0.00  0.00  0.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-1  0.00 0.00  0.00  0.60 0.00 4.80
 8.00 0.005.33   2.67   0.16
 dm-2  0.00 0.00  0.00  0.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-3  0.00 0.00  0.00 24.80 0.00   198.40
 8.00 0.249.80   0.13   0.32
 dm-4  0.00 0.00  0.00  6.60 0.0052.80

Re: Nodetool repair exceptions in Cassandra 2.0.2

2013-12-11 Thread Aaron Morton

 [2013-12-08 11:04:02,047] Repair session ff16c510-5ff7-11e3-97c0-5973cc397f8f 
 for range (1246984843639507027,1266616572749926276] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #ff16c510-5ff7-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, 
 (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48
the 10.x.x.48 node sent a tree response (merkle tree) to this node that did not 
contain the tree. This node then killed the repair session. 

Look for log messages on 10.x.x.48 that correlate with the repair session ID 
above. They may look like 

logger.error(Failed creating a merkle tree for  + desc + ,  + initiator +  
(see log for details)”);

or 

logger.info(String.format([repair #%s] Sending completed merkle tree to %s for 
%s/%s, desc.sessionId, initiator, desc.keyspace, desc.columnFamily));

Hope that helps. 

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 10/12/2013, at 12:57 pm, Laing, Michael michael.la...@nytimes.com wrote:

 My experience is that you must upgrade to 2.0.3 ASAP to fix this.
 
 Michael
 
 
 On Mon, Dec 9, 2013 at 6:39 PM, David Laube d...@stormpath.com wrote:
 Hi All,
 
 We are running Cassandra 2.0.2 and have recently stumbled upon an issue with 
 nodetool repair. Upon running nodetool repair on each of the 5 nodes in the 
 ring (one at a time) we observe the following exceptions returned to standard 
 out;
 
 
 [2013-12-08 11:04:02,047] Repair session ff16c510-5ff7-11e3-97c0-5973cc397f8f 
 for range (1246984843639507027,1266616572749926276] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #ff16c510-5ff7-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, 
 (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48
 [2013-12-08 11:04:02,063] Repair session 284c8b40-5ff8-11e3-97c0-5973cc397f8f 
 for range (-109256956528331396,-89316884701275697] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #284c8b40-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family2, 
 (-109256956528331396,-89316884701275697]] Validation failed in /10.x.x.103
 [2013-12-08 11:04:02,070] Repair session 399e7160-5ff8-11e3-97c0-5973cc397f8f 
 for range (8901153810410866970,8915879751739915956] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #399e7160-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, 
 (8901153810410866970,8915879751739915956]] Validation failed in /10.x.x.103
 [2013-12-08 11:04:02,072] Repair session 3ea73340-5ff8-11e3-97c0-5973cc397f8f 
 for range (1149084504576970235,1190026362216198862] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #3ea73340-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, 
 (1149084504576970235,1190026362216198862]] Validation failed in /10.x.x.103
 [2013-12-08 11:04:02,091] Repair session 6f0da460-5ff8-11e3-97c0-5973cc397f8f 
 for range (-5407189524618266750,-5389231566389960750] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #6f0da460-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, 
 (-5407189524618266750,-5389231566389960750]] Validation failed in /10.x.x.103
 [2013-12-09 23:16:36,962] Repair session 7efc2740-6127-11e3-97c0-5973cc397f8f 
 for range (1246984843639507027,1266616572749926276] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #7efc2740-6127-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, 
 (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48
 [2013-12-09 23:16:36,986] Repair session a8c44260-6127-11e3-97c0-5973cc397f8f 
 for range (-109256956528331396,-89316884701275697] failed with error 
 org.apache.cassandra.exceptions.RepairException: [repair 
 #a8c44260-6127-11e3-97c0-5973cc397f8f on keyspace_name/col_family2, 
 (-109256956528331396,-89316884701275697]] Validation failed in /10.x.x.210
 
 The /var/log/cassandra/system.log shows similar info as above with no real 
 explanation as to the root cause behind the exception(s).  There also does 
 not appear to be any additional info in /var/log/cassandra/cassandra.log. We 
 have tried restoring a recent snapshot of the keyespace in question to a 
 separate staging ring and the repair runs successfully and without exception 
 there. This is even after we tried insert/delete on the keyspace in the 
 separate staging ring. Has anyone seen this behavior before and what can we 
 do to resolve this? Any assistance would be greatly appreciated.
 
 Best regards,
 -Dave

Re: setting PIG_INPUT_INITIAL_ADDRESS environment . variable in Oozie for cassandra ...¿?

2013-12-11 Thread Aaron Morton

 Caused by: java.io.IOException: PIG_INPUT_INITIAL_ADDRESS or 
 PIG_INITIAL_ADDRESS environment variable not set
   at 
 org.apache.cassandra.hadoop.pig.CassandraStorage.setLocation(CassandraStorage.java:314)
   at 
 org.apache.cassandra.hadoop.pig.CassandraStorage.getSchema(CassandraStorage.java:358)
   at 
 org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:151)
   ... 35 more

Have you checked these are set ?

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 11/12/2013, at 4:00 am, Miguel Angel Martin junquera 
mianmarjun.mailingl...@gmail.com wrote:

 Hi,
 
 I have an error with pig action in oozie 4.0.0  using cassandraStorage. 
 (cassandra 1.2.10)
 
 I can run pig scripts right  with cassandra. but whe I try to use 
 cassandraStorage to load data I have this error:
 
 
 Run pig script using PigRunner.run() for Pig version 0.8+
 Apache Pig version 0.10.0 (r1328203) 
 compiled Apr 20 2012, 00:33:25
 
 Run pig script using PigRunner.run() for Pig version 0.8+
 2013-12-10 12:24:39,084 [main] INFO  org.apache.pig.Main  - Apache Pig 
 version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25
 2013-12-10 12:24:39,084 [main] INFO  org.apache.pig.Main  - Apache Pig 
 version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25
 2013-12-10 12:24:39,095 [main] INFO  org.apache.pig.Main  - Logging error 
 messages to: 
 /tmp/hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_201312100858_0007/attempt_201312100858_0007_m_00_0/work/pig-job_201312100858_0007.log
 2013-12-10 12:24:39,095 [main] INFO  org.apache.pig.Main  - Logging error 
 messages to: 
 /tmp/hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_201312100858_0007/attempt_201312100858_0007_m_00_0/work/pig-job_201312100858_0007.log
 2013-12-10 12:24:39,501 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting 
 to hadoop file system at: hdfs://10.228.243.18:9000
 2013-12-10 12:24:39,501 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting 
 to hadoop file system at: hdfs://10.228.243.18:9000
 2013-12-10 12:24:39,510 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting 
 to map-reduce job tracker at: 10.228.243.18:9001
 2013-12-10 12:24:39,510 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting 
 to map-reduce job tracker at: 10.228.243.18:9001
 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt  - 
 ERROR 2245: 
 file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc 
 org.apache.cassandra.hadoop.pig.CassandraStorage
 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt  - 
 ERROR 2245: 
 file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc 
 org.apache.cassandra.hadoop.pig.CassandraStorage
 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt  - 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2245: 
 file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc 
 org.apache.cassandra.hadoop.pig.CassandraStorage
   at 
 org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:155)
   at 
 org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:110)
   at 
 org.apache.pig.newplan.logical.relational.LOStore.getSchema(LOStore.java:68)
   at 
 org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.validate(SchemaAliasVisitor.java:60)
   at 
 org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.visit(SchemaAliasVisitor.java:84)
   at 
 org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1617)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1611)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1334)
   at org.apache.pig.PigServer.execute(PigServer.java:1239)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:430)
   at org.apache.pig.PigRunner.run(PigRunner.java:49)
   at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:283)
   at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:223)
   at

Re: Exactly one wide row per node for a given CF?

2013-12-11 Thread Aaron Morton

  Querying the table was fast. What I didn’t do was test the table under load, 
 nor did I try this in a multi-node cluster.
As the number of columns in a row increases so does the size of the column 
index which is read as part of the read path. 

For background and comparisons of latency see 
http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html  or my talk 
on performance at the SF summit last year 
http://thelastpickle.com/speaking/2012/08/08/Cassandra-Summit-SF.html While the 
column index has been lifted to the -Index.db component AFAIK it must still be 
fully loaded.

Larger rows take longer to go through compaction, tend to cause more JVM GC and 
have issue during repair. See the in_memory_compaction_limit_in_mb comments in 
the yaml file. During repair we detect differences in ranges of rows and stream 
them between the nodes. If you have wide rows and a single column is our of 
sync we will create a new copy of that row on the node, which must then be 
compacted. I’ve seen the load on nodes with very wide rows go down by 150GB 
just by reducing the compaction settings. 

IMHO all things been equal rows in the few 10’s of MB work better. 

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 11/12/2013, at 2:41 am, Robert Wille rwi...@fold3.com wrote:

 I have a question about this statement:
 
 When rows get above a few 10’s  of MB things can slow down, when they get 
 above 50 MB they can be a pain, when they get above 100MB it’s a warning 
 sign. And when they get above 1GB, well you you don’t want to know what 
 happens then. 
 
 I tested a data model that I created. Here’s the schema for the table in 
 question:
 
 CREATE TABLE bdn_index_pub (
   tree INT,
   pord INT,
   hpath VARCHAR,
   PRIMARY KEY (tree, pord)
 );
 
 As a test, I inserted 100 million records. tree had the same value for every 
 record, and I had 100 million values for pord. hpath averaged about 50 
 characters in length. My understanding is that all 100 million strings would 
 have been stored in a single row, since they all had the same value for the 
 first component of the primary key. I didn’t look at the size of the table, 
 but it had to be several gigs (uncompressed). Contrary to what Aaron says, I 
 do want to know what happens, because I didn’t experience any issues with 
 this table during my test. Inserting was fast. The last batch of records 
 inserted in approximately the same amount of time as the first batch. 
 Querying the table was fast. What I didn’t do was test the table under load, 
 nor did I try this in a multi-node cluster.
 
 If this is bad, can somebody suggest a better pattern? This table was 
 designed to support a query like this: select hpath from bdn_index_pub where 
 tree = :tree and pord = :start and pord = :end. In my application, most 
 trees will have less than a million records. A handful will have 10’s of 
 millions, and one of them will have 100 million.
 
 If I need to break up my rows, my first instinct would be to divide each tree 
 into blocks of say 10,000 and change tree to a string that contains the tree 
 and the block number. Something like this:
 
 17:0, 0, ‘/’
 …
 17:0, , ’/a/b/c’
 17:1,1, ‘/a/b/d’
 …
 
 I’d then need to issue an extra query for ranges that crossed block 
 boundaries.
 
 Any suggestions on a better pattern?
 
 Thanks
 
 Robert
 
 From: Aaron Morton aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.org
 Date: Tuesday, December 10, 2013 at 12:33 AM
 To: Cassandra User user@cassandra.apache.org
 Subject: Re: Exactly one wide row per node for a given CF?
 
 But this becomes troublesome if I add or remove nodes. What effectively I 
 want is to partition on the unique id of the record modulus N (id % N; 
 where N is the number of nodes).
 This is exactly the problem consistent hashing (used by cassandra) is 
 designed to solve. If you hash the key and modulo the number of nodes, adding 
 and removing nodes requires a lot of data to move. 
 
 I want to be able to randomly distribute a large set of records but keep 
 them clustered in one wide row per node.
 Sounds like you should revisit your data modelling, this is a pretty well 
 known anti pattern. 
 
 When rows get above a few 10’s  of MB things can slow down, when they get 
 above 50 MB they can be a pain, when they get above 100MB it’s a warning 
 sign. And when they get above 1GB, well you you don’t want to know what 
 happens then. 
 
 It’s a bad idea and you should take another look at the data model. If you 
 have to do it, you can try the ByteOrderedPartitioner which uses the row key 
 as a token, given you total control of the row placement. 
 
 Cheers
 
 
 -
 Aaron Morton
 New Zealand
 @aaronmorton
 
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
 
 On 4/12/2013, at 8:32 pm, Vivek Mishra

user / password authentication advice

2013-12-11 Thread onlinespending

Hi,

I’m using Cassandra in an environment where many users can login to use an 
application I’m developing. I’m curious if anyone has any advice or links to 
documentation / blogs where it discusses common implementations or best 
practices for user and password authentication. My cursory search online didn’t 
bring much up on the subject. I suppose the information needn’t even be 
specific to Cassandra.

I imagine a few basic steps will be as follows:

user types in username (e.g. email address) and password
this is verified against a table storing username and passwords (encrypted in 
some way)
a token is return to the app / web browser to allow further transactions using 
secure token (e.g. cookie)

Obviously I’m only scratching the surface and it’s the detail and best 
practices of implementing this user / password authentication that I’m curious 
about.

Thank you,
Ben

Re: Data tombstoned during bulk loading 1.2.10 - 2.0.3

2013-12-11 Thread Robert Coli

On Wed, Dec 11, 2013 at 6:27 AM, Mathijs Vogelzang
math...@apptornado.comwrote:

 When I use sstable2json on the sstable on the destination cluster, it has
 metadata: {deletionInfo:
 {markedForDeleteAt:1796952039620607,localDeletionTime:0}}, whereas
 it doesn't have that in the source sstable.
 (Yes, this is a timestamp far into the future. All our hosts are
 properly synced through ntp).


This seems like a bug in sstableloader, I would report it on JIRA.


 Naturally, copying the data again doesn't work to fix it, as the
 tombstone is far in the future. Apart from not having this happen at
 all, how can it be fixed?


Briefly, you'll want to purge that tombstone and then reload the data with
a reasonable timestamp.

Dealing with rows with data (and tombstones) in the far future is described
in detail here :

http://thelastpickle.com/blog/2011/12/15/Anatomy-of-a-Cassandra-Partition.html

=Rob

Re:

2013-12-11 Thread Aaron Morton

 SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', 
 comparator_type='CompositeType', default_validation_class='UTF8Type', 
 key_validation_class='UTF8Type', column_validation_classes=validators)
  

CompositeType is a type composed of other types, see 

http://pycassa.github.io/pycassa/assorted/composite_types.html?highlight=compositetype

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/12/2013, at 6:15 am, Kumar Ranjan winnerd...@gmail.com wrote:

 Hey Folks,
 
 So I am creating, column family using pycassaShell. See below:
 
 validators = {
 
 'approved':  'BooleanType',  
 
 'text':  'UTF8Type', 
 
 'favorite_count':'IntegerType',  
 
 'retweet_count': 'IntegerType',  
 
 'expanded_url':  'UTF8Type', 
 
 'tuid':  'LongType', 
 
 'screen_name':   'UTF8Type', 
 
 'profile_image': 'UTF8Type', 
 
 'embedly_data':  'CompositeType',
 
 'created_at':'UTF8Type', 
 
 
 }
 
 SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', 
 comparator_type='CompositeType', default_validation_class='UTF8Type', 
 key_validation_class='UTF8Type', column_validation_classes=validators)

 
 I am getting this error:
 
 InvalidRequestException: InvalidRequestException(why='Invalid definition for 
 comparator org.apache.cassandra.db.marshal.CompositeType.'  
 
 
 
 My data will look like this:
 
 'row_key' : { 'tid' :
 
 {
 
 'expanded_url': u'http://instagram.com/p/hwDj2BJeBy/',
 
 'text': '#snowinginNYC Makes me so happy\xe2\x9d\x840brittles0 
 \xe2\x9b\x84 @ Grumman Studios http://t.co/rlOvaYSfKa',
 
 'profile_image': 
 u'https://pbs.twimg.com/profile_images/3262070059/1e82f895559b904945d28cd3ab3947e5_normal.jpeg',
 
 'tuid': 339322611,
 
 'approved': 'true',
 
 'favorite_count': 0,
 
 'screen_name': u'LonaVigi',
 
 'created_at': u'Wed Dec 11 01:10:05 + 2013',
 
 'embedly_data': {u'provider_url': u'http://instagram.com/', 
 u'description': ulonavigi's photo on Instagram, u'title': 
 u'#snwinginNYC Makes me so happy\u2744@0brittles0 \u26c4', u'url': 
 u'http://distilleryimage7.ak.instagram.com/5b880dec61c711e3a50b129314edd3b_8.jpg',
  u'thumbnail_width': 640, u'height': 640, u'width': 640, u'thumbnail_url': 
 u'http://distilleryimage7.ak.instagram.com/b880dec61c711e3a50b1293d14edd3b_8.jpg',
  u'author_name': u'lonavigi', u'version': u'1.0', u'provider_name': 
 u'Instagram', u'type': u'poto', u'thumbnail_height': 640, u'author_url': 
 u'http://instagram.com/lonavigi'},
 
 'tid': 410577192746500096,
 
 'retweet_count': 0
 
 } 
 
 }

Re: Cyclop - CQL3 web based editor

2013-12-11 Thread Aaron Morton

thanks, looks handy. 

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/12/2013, at 6:16 am, Parth Patil parthpa...@gmail.com wrote:

 Hi Maciej,
 This looks great! Thanks for building this.
 
 
 On Wed, Dec 11, 2013 at 12:45 AM, Murali muralidharan@gmail.com wrote:
 Hi Maciej,
 Thanks for sharing it.
 
 
 
 
 On Wed, Dec 11, 2013 at 2:09 PM, Maciej Miklas mac.mik...@gmail.com wrote:
 Hi all,
 
 This is the Cassandra mailing list, but I've developed something that is 
 strictly related to Cassandra, and some of you might find it useful, so I've 
 decided to send email to this group.
 
 This is web based CQL3 editor. The idea is, to deploy it once and have simple 
 and comfortable CQL3 interface over web - without need to install anything.
 
 The editor itself supports code completion, not only based on CQL syntax, but 
 also based database content - so for example the select statement will 
 suggest tables from active keyspace, or in where closure only columns from 
 table provided after select from
 
 The results are displayed in reversed table - rows horizontally and columns 
 vertically. It seems to be more natural for column oriented database.
  
 You can also export query results to CSV, or add query as browser bookmark.
  
 The whole application is based on wicket + bootstrap + spring and can be 
 deployed in any web 3.0 container.
  
 Here is the project (open source): https://github.com/maciejmiklas/cyclop
 
 
 Have a fun!
 Maciej
 
 
 
 -- 
 Thanks,
 Murali
 99025-5
 
 
 
 
 -- 
 Best,
 Parth

Re: CLUSTERING ORDER CQL3

2013-12-11 Thread Aaron Morton

You need to specify all the clustering key components in the CLUSTERING ORDER 
BY clause 

create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH 
CLUSTERING ORDER BY (cid ASC, ts DESC);

cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/12/2013, at 10:44 am, Shrikar archak shrika...@gmail.com wrote:

 Hi All,
 
 My Usecase
 
 I want query result by ordered by timestamp DESC. But I don't want timestamp 
 to be the second column in the primary key as that will take of my querying 
 capability
 
 for example
 
 
 create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH 
 CLUSTERING ORDER BY (ts DESC);
 
 Queries required:
 
 
 I want the result for all the below queries to be in DESC order of timestamp
 
 select * from demo where oid = 100;
 select * from demo where oid = 100 and cid = 10;
 select * from demo where oid = 100 and cid = 100 and ts  
 minTimeuuid('something');
 
 I am trying to create this table with CLUSTERING ORDER IN CQL and getting 
 this error
 
 
 cqlsh:viralheat create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY 
 (oid,cid,ts)) WITH CLUSTERING ORDER BY (ts desc);
 Bad Request: Missing CLUSTERING ORDER for column cid
 
 In this document it mentions that we can have multple keys for cluster 
 ordering. any one know how to do that?
 
 Go here Datastax doc
 
 
 
 If I make the timestamp the second column then I cant have queries likes 
 
 
 select * from demo where oid = 100 and cid = 100 and ts  
 minTimeuuid('something');
 
 Thanks,
 
 Shrikar

Re: Bulkoutputformat

2013-12-11 Thread Aaron Morton

If you don’t need to use Hadoop then try the SSTableSimpleWriter and 
sstableloader , this post is a little old but still relevant 
http://www.datastax.com/dev/blog/bulk-loading

Otherwise AFAIK BulkOutputFormat is what you want from hadoop 
http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration

Cheers
 
-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com wrote:

 Hi All,
 
 I want to bulk insert data into cassandra. I was wondering of using 
 BulkOutputformat in hadoop. Is it the best way or using driver and doing 
 batch insert is the better way. 
 
 Are there any disandvantages of using bulkoutputformat. 
 
 Thanks for helping
 
 Varun

Re: efficient way to store 8-bit or 16-bit value?

2013-12-11 Thread Aaron Morton

 What do people recommend I do to store a small binary value in a column? I’d 
 rather not simply use a 32-bit int for a single byte value. 
blob is a byte array
or you could use the varint, a variable length integer, but you probably want 
the blob. 

cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/12/2013, at 1:33 pm, Andrey Ilinykh ailin...@gmail.com wrote:

 Column metadata is about 20 bytes. So, there is no big difference if you save 
 1 or 4 bytes.
 
 Thank you,
   Andrey
 
 
 On Wed, Dec 11, 2013 at 2:42 PM, onlinespending onlinespend...@gmail.com 
 wrote:
 What do people recommend I do to store a small binary value in a column? I’d 
 rather not simply use a 32-bit int for a single byte value. Can I have a one 
 byte blob? Or should I store it as a single character ASCII string? I imagine 
 each is going to have the overhead of storing the length (or null termination 
 in the case of a string). That overhead may be worse than simply using a 
 32-bit int.
 
 Also is it possible to partition on a single character or substring of 
 characters from a string (or a portion of a blob)? Something like:
 
 CREATE TABLE test (
 id text,
 value blob,
 PRIMARY KEY (string[0:1])
 )

Re: Write performance with 1.2.12

2013-12-11 Thread Aaron Morton

It is the write latency, read latency is ok. Interestingly the latency is low
when there is one node. When I join other nodes the latency drops about 1/3.
To be specific, when I start sending traffic to the other nodes the latency
for all the nodes increases, if I stop traffic to other nodes the latency
drops again, I checked, this is not node specific it happens to any node.
Is this the local write latency or the cluster wide write request latency ?

What sort of numbers are you seeing ?

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/12/2013, at 3:39 pm, srmore comom...@gmail.com wrote:

Thanks Aaron

On Wed, Dec 11, 2013 at 8:15 PM, Aaron Morton aa...@thelastpickle.com wrote:
Changed memtable_total_space_in_mb to 1024 still no luck.

Reducing memtable_total_space_in_mb will increase the frequency of flushing
to disk, which will create more for compaction to do and result in increased
IO.

You should return it to the default.

You are right, had to revert it back to default.

when I send traffic to one node its performance is 2x more than when I send
traffic to all the nodes.

What are you measuring, request latency or local read/write latency ?

If it’s write latency it’s probably GC, if it’s read is probably IO or data
model.

I don't see any GC activity in logs. Tried to control the compaction by
reducing the number of threads, did not help much.

Hope that helps.

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote:

Changed memtable_total_space_in_mb to 1024 still no luck.

On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote:
Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3
which is 8/3 ~ 2.6 gb in capacity
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management

The flushing of 2.6 gb to the disk might slow the performance if frequently
called, may be you have lots of write operations going on.

On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote:

On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote:
You have passed the JVM configurations and not the cassandra configurations
which is in cassandra.yaml.

Apologies, was tuning JVM and that's what was in my mind.
Here are the cassandra settings http://pastebin.com/uN42GgYT

The spikes are not that significant in our case and we are running the
cluster with 1.7 gb heap.

Are these spikes causing any issue at your end?

There are no big spikes, the overall performance seems to be about 40% low.

On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote:

On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote:
Hard to say much without knowing about the cassandra configurations.

The cassandra configuration is
-Xms8G
-Xmx8G
-Xmn800m
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=4
-XX:MaxTenuringThreshold=2
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

Yes compactions/GC's could skipe the CPU, I had similar behavior with my
setup.

Were you able to get around it ?

-VK

On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote:
We have a 3 node cluster running cassandra 1.2.12, they are pretty big
machines 64G ram with 16 cores, cassandra heap is 8G.

The interesting observation is that, when I send traffic to one node its
performance is 2x more than when I send traffic to all the nodes. We ran
1.0.11 on the same box and we observed a slight dip but not half as seen
with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing
CL to ONE make a slight improvement but not much.

The read_Repair_chance is 0.1. We see some compactions running.

following is my iostat -x output, sda is the ssd (for commit log) and sdb is
the spinner.

avg-cpu: %user %nice %system %iowait %steal %idle
66.460.008.950.010.00 24.58

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 0.0027.60 0.00 4.40 0.00 256.0058.18
0.012.55 1.32 0.58
sda1 0.00 0.00 0.00 0.00 0.00 0.00

Re: user / password authentication advice

2013-12-11 Thread Aaron Morton

Not sure if you are asking about the authentication  authorisation in 
cassandra or how to implemented the same using cassandra. 

info on the cassandra authentication and authorisation is here 
http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/security/securityTOC.html

Hope that helps. 

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/12/2013, at 4:31 pm, onlinespending onlinespend...@gmail.com wrote:

 Hi,
 
 I’m using Cassandra in an environment where many users can login to use an 
 application I’m developing. I’m curious if anyone has any advice or links to 
 documentation / blogs where it discusses common implementations or best 
 practices for user and password authentication. My cursory search online 
 didn’t bring much up on the subject. I suppose the information needn’t even 
 be specific to Cassandra.
 
 I imagine a few basic steps will be as follows:
 
 user types in username (e.g. email address) and password
 this is verified against a table storing username and passwords (encrypted in 
 some way)
 a token is return to the app / web browser to allow further transactions 
 using secure token (e.g. cookie)
 
 Obviously I’m only scratching the surface and it’s the detail and best 
 practices of implementing this user / password authentication that I’m 
 curious about.
 
 Thank you,
 Ben

Re: user / password authentication advice

2013-12-11 Thread Janne Jalkanen


Hi!

You're right, this isn't really Cassandra-specific. Most languages/web 
frameworks have their own way of doing user authentication, and then you just 
typically write a plugin that just stores whatever data the system needs in 
Cassandra.

For example, if you're using Java (or Scala or Groovy or anything else 
JVM-based), Apache Shiro is a good way of doing user authentication and 
authorization. http://shiro.apache.org/. Just implement a custom Realm for 
Cassandra and you should be set.

/Janne

On Dec 12, 2013, at 05:31 , onlinespending onlinespend...@gmail.com wrote:

 Hi,
 
 I’m using Cassandra in an environment where many users can login to use an 
 application I’m developing. I’m curious if anyone has any advice or links to 
 documentation / blogs where it discusses common implementations or best 
 practices for user and password authentication. My cursory search online 
 didn’t bring much up on the subject. I suppose the information needn’t even 
 be specific to Cassandra.
 
 I imagine a few basic steps will be as follows:
 
 user types in username (e.g. email address) and password
 this is verified against a table storing username and passwords (encrypted in 
 some way)
 a token is return to the app / web browser to allow further transactions 
 using secure token (e.g. cookie)
 
 Obviously I’m only scratching the surface and it’s the detail and best 
 practices of implementing this user / password authentication that I’m 
 curious about.
 
 Thank you,
 Ben

Re: user / password authentication advice

2013-12-11 Thread onlinespending

OK, thanks for getting me going in the right direction. I imagine most people 
would store password and tokenized authentication information in a single 
table, using the username (e.g. email address) as the key?


On Dec 11, 2013, at 10:44 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote:

 
 Hi!
 
 You're right, this isn't really Cassandra-specific. Most languages/web 
 frameworks have their own way of doing user authentication, and then you just 
 typically write a plugin that just stores whatever data the system needs in 
 Cassandra.
 
 For example, if you're using Java (or Scala or Groovy or anything else 
 JVM-based), Apache Shiro is a good way of doing user authentication and 
 authorization. http://shiro.apache.org/. Just implement a custom Realm for 
 Cassandra and you should be set.
 
 /Janne
 
 On Dec 12, 2013, at 05:31 , onlinespending onlinespend...@gmail.com wrote:
 
 Hi,
 
 I’m using Cassandra in an environment where many users can login to use an 
 application I’m developing. I’m curious if anyone has any advice or links to 
 documentation / blogs where it discusses common implementations or best 
 practices for user and password authentication. My cursory search online 
 didn’t bring much up on the subject. I suppose the information needn’t even 
 be specific to Cassandra.
 
 I imagine a few basic steps will be as follows:
 
 user types in username (e.g. email address) and password
 this is verified against a table storing username and passwords (encrypted 
 in some way)
 a token is return to the app / web browser to allow further transactions 
 using secure token (e.g. cookie)
 
 Obviously I’m only scratching the surface and it’s the detail and best 
 practices of implementing this user / password authentication that I’m 
 curious about.
 
 Thank you,
 Ben

47 matches

Mail list logo