Re: cql query

2013-05-02 Thread Jabbar Azam
Hello Sri,

As far as I know you can if name and age are part of your partition key and
timestamp is the cluster key e.g.

create table columnfamily (
name varchar,
age varchar,
tstamp timestamp,
   partition key((name, age), tstamp)
);




Thanks

Jabbar Azam


On 2 May 2013 11:45, Sri Ramya ramya.1...@gmail.com wrote:

 hi

 Can some body tell me is it possible to to do multiple query on cassandra
 like Select * from columnfamily where name='foo' and age ='21' and
 timestamp = 'unixtimestamp' ;

 Please tell me some guidence for these kind of queries

   Thank you



Re: cql query

2013-05-02 Thread Sri Ramya
thank you very much. i will try and let you know whether its working or not


On Thu, May 2, 2013 at 7:04 PM, Jabbar Azam aja...@gmail.com wrote:

 Hello Sri,

 As far as I know you can if name and age are part of your partition key
 and timestamp is the cluster key e.g.

 create table columnfamily (
 name varchar,
 age varchar,
 tstamp timestamp,
partition key((name, age), tstamp)
 );




 Thanks

 Jabbar Azam


 On 2 May 2013 11:45, Sri Ramya ramya.1...@gmail.com wrote:

 hi

 Can some body tell me is it possible to to do multiple query on cassandra
 like Select * from columnfamily where name='foo' and age ='21' and
 timestamp = 'unixtimestamp' ;

 Please tell me some guidence for these kind of queries

   Thank you





Re: Anyway To Query Just The Partition Key?

2013-04-22 Thread Sylvain Lebresne
What you want is https://issues.apache.org/jira/browse/CASSANDRA-4536 I
believe.


On Sat, Apr 13, 2013 at 8:16 PM, Gareth Collins
gareth.o.coll...@gmail.comwrote:

 Edward,

 Thanks for the response. This is what I thought. The only reason why I am
 doing it like this is that I don't know these partition keys in advance
 (otherwise I would design this differently). So when I need to insert data,
 it looks like I need to insert to both the data table and the table
 containing the partition keys. Good thing writes in Cassandra are
 idempotent...:)

 thanks again,
 Gareth


 On Sat, Apr 13, 2013 at 7:26 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 You can 'list' or 'select *' the column family and you get them in a
 pseudo random order. When you say subset it implies you might want a
 specific range which is something this schema can not do.




 On Sat, Apr 13, 2013 at 2:05 AM, Gareth Collins 
 gareth.o.coll...@gmail.com wrote:

 Hello,

 If I have a cql3 table like this (I don't have a table with this data -
 this is just for example):

 create table (
 surname text,
 city text,
 country text,
 event_id timeuuid,
 data text,
 PRIMARY KEY ((surname, city, country),event_id));

 there is no way of (easily) getting the set (or a subset) of partition
 keys, is there (i.e. surname/city/country)? If I want easy access to do
 queries to get a subset of the partition keys, I have to create another
 table?

 I am assuming yes but just making sure I am not missing something
 obvious here.

 thanks in advance,
 Gareth






Anyway To Query Just The Partition Key?

2013-04-13 Thread Gareth Collins
Hello,

If I have a cql3 table like this (I don't have a table with this data -
this is just for example):

create table (
surname text,
city text,
country text,
event_id timeuuid,
data text,
PRIMARY KEY ((surname, city, country),event_id));

there is no way of (easily) getting the set (or a subset) of partition
keys, is there (i.e. surname/city/country)? If I want easy access to do
queries to get a subset of the partition keys, I have to create another
table?

I am assuming yes but just making sure I am not missing something obvious
here.

thanks in advance,
Gareth


Re: Anyway To Query Just The Partition Key?

2013-04-13 Thread Jabbar Azam
With your example you can do an equality search with surname and city and
then use in with country

Eg.  Select * from yourtable where surname=blah and city=blah blah and
country in (country1, country2)

Hope that helps

Jabbar Azam
On 13 Apr 2013 07:06, Gareth Collins gareth.o.coll...@gmail.com wrote:

 Hello,

 If I have a cql3 table like this (I don't have a table with this data -
 this is just for example):

 create table (
 surname text,
 city text,
 country text,
 event_id timeuuid,
 data text,
 PRIMARY KEY ((surname, city, country),event_id));

 there is no way of (easily) getting the set (or a subset) of partition
 keys, is there (i.e. surname/city/country)? If I want easy access to do
 queries to get a subset of the partition keys, I have to create another
 table?

 I am assuming yes but just making sure I am not missing something obvious
 here.

 thanks in advance,
 Gareth



Re: Anyway To Query Just The Partition Key?

2013-04-13 Thread Edward Capriolo
You can 'list' or 'select *' the column family and you get them in a pseudo
random order. When you say subset it implies you might want a specific
range which is something this schema can not do.




On Sat, Apr 13, 2013 at 2:05 AM, Gareth Collins
gareth.o.coll...@gmail.comwrote:

 Hello,

 If I have a cql3 table like this (I don't have a table with this data -
 this is just for example):

 create table (
 surname text,
 city text,
 country text,
 event_id timeuuid,
 data text,
 PRIMARY KEY ((surname, city, country),event_id));

 there is no way of (easily) getting the set (or a subset) of partition
 keys, is there (i.e. surname/city/country)? If I want easy access to do
 queries to get a subset of the partition keys, I have to create another
 table?

 I am assuming yes but just making sure I am not missing something obvious
 here.

 thanks in advance,
 Gareth



Re: Anyway To Query Just The Partition Key?

2013-04-13 Thread Gareth Collins
Thank you for the answer.

My apologies. I should have been clearer with my question.

Say for example, I have a 1000 partition keys and 1 rows per partition
key I am trying to avoid bringing back 10 million rows to find the 1000
partition keys. I assume I cannot avoid bringing back the 10 million rows
(or at least an order of magnitude more than 1000 rows) without having
another table?

thanks,
Gareth


On Sat, Apr 13, 2013 at 4:13 AM, Jabbar Azam aja...@gmail.com wrote:

 With your example you can do an equality search with surname and city and
 then use in with country

 Eg.  Select * from yourtable where surname=blah and city=blah blah and
 country in (country1, country2)

 Hope that helps

 Jabbar Azam
 On 13 Apr 2013 07:06, Gareth Collins gareth.o.coll...@gmail.com wrote:

 Hello,

 If I have a cql3 table like this (I don't have a table with this data -
 this is just for example):

 create table (
 surname text,
 city text,
 country text,
 event_id timeuuid,
 data text,
 PRIMARY KEY ((surname, city, country),event_id));

 there is no way of (easily) getting the set (or a subset) of partition
 keys, is there (i.e. surname/city/country)? If I want easy access to do
 queries to get a subset of the partition keys, I have to create another
 table?

 I am assuming yes but just making sure I am not missing something obvious
 here.

 thanks in advance,
 Gareth




Re: Anyway To Query Just The Partition Key?

2013-04-13 Thread Gareth Collins
Edward,

Thanks for the response. This is what I thought. The only reason why I am
doing it like this is that I don't know these partition keys in advance
(otherwise I would design this differently). So when I need to insert data,
it looks like I need to insert to both the data table and the table
containing the partition keys. Good thing writes in Cassandra are
idempotent...:)

thanks again,
Gareth


On Sat, Apr 13, 2013 at 7:26 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 You can 'list' or 'select *' the column family and you get them in a
 pseudo random order. When you say subset it implies you might want a
 specific range which is something this schema can not do.




 On Sat, Apr 13, 2013 at 2:05 AM, Gareth Collins 
 gareth.o.coll...@gmail.com wrote:

 Hello,

 If I have a cql3 table like this (I don't have a table with this data -
 this is just for example):

 create table (
 surname text,
 city text,
 country text,
 event_id timeuuid,
 data text,
 PRIMARY KEY ((surname, city, country),event_id));

 there is no way of (easily) getting the set (or a subset) of partition
 keys, is there (i.e. surname/city/country)? If I want easy access to do
 queries to get a subset of the partition keys, I have to create another
 table?

 I am assuming yes but just making sure I am not missing something obvious
 here.

 thanks in advance,
 Gareth





Re: Getting NullPointerException while executing query

2013-04-11 Thread Kuldeep Mishra
I am using cassandra 1.2.0,


Thanks
Kuldeep


On Wed, Apr 10, 2013 at 10:40 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On which version of Cassandra are you? I can't reproduce the
 NullPointerException on Cassandra 1.2.3.

 That being said, that query is not valid, so you will get an error
 message. There is 2 reasons why it's not valid:
   1) in token(deep), deep is not a valid term. So you should have
 something like: token('deep').
   2) the name column is not the partition key so the token method cannot
 be applied to it.

 A valid query with that schema would be for instance:
   select * from CQLUSER where token(id)  token(4)
 though I don't know if that help in any way for what you aimed to do.

 --
 Sylvain


 On Wed, Apr 10, 2013 at 9:42 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.comwrote:

 Hi ,
  TABLE -
 CREATE TABLE CQLUSER (
   id int PRIMARY KEY,
   age int,
   name text
 )
 Query -
   select * from CQLUSER where token(name)  token(deep);

 ERROR -
 Bad Request: Failed parsing statement: [select * from CQLUSER where
 token(name)  token(deep);] reason: NullPointerException null
 text could not be lexed at line 1, char 15

 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





-- 
Thanks and Regards
Kuldeep Kumar Mishra
+919540965199


Re: describe keyspace or column family query not working

2013-04-11 Thread aaron morton
tables created without COMPACT STORAGE are still visible in cassandra-cli.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/04/2013, at 5:40 AM, Tyler Hobbs ty...@datastax.com wrote:

 
 On Wed, Apr 10, 2013 at 11:09 AM, Vivek Mishra mishra.v...@gmail.com wrote:
 Ok. A column family and keyspace created via cqlsh using cql3 is visible via 
 cassandra-cli or thrift API?
 
 The column family will only be visible via cassandra-cli and the Thrift API 
 if it was created WITH COMPACT STORAGE: 
 http://www.datastax.com/docs/1.2/cql_cli/cql/CREATE_TABLE#using-compact-storage
 
 
 -- 
 Tyler Hobbs
 DataStax



describe keyspace or column family query not working

2013-04-10 Thread Kuldeep Mishra
Hi ,
I am trying to execute following query but not working and throwing
exception

QUERY:--
 Cassandra.Client client;
 client.execute_cql3_query(ByteBuffer.wrap(describe keyspace
mykeyspace.getBytes(Constants.CHARSET_UTF8)),   Compression.NONE,
ConsistencyLevel.ONE);

 client.execute_cql3_query(ByteBuffer.wrap(describe table
mytable.getBytes(Constants.CHARSET_UTF8)),   Compression.NONE,
ConsistencyLevel.ONE);

but both query giving following exception,

STACK TRACE

InvalidRequestException(why:line 1:0 no viable alternative at input
'describe')
at
org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37849)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at
org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
at
org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)

Please help..


Thanks and Regards
Kuldeep






-- 
Thanks and Regards
Kuldeep Kumar Mishra
+919540965199


Getting NullPointerException while executing query

2013-04-10 Thread Kuldeep Mishra
Hi ,
 TABLE -
CREATE TABLE CQLUSER (
  id int PRIMARY KEY,
  age int,
  name text
)
Query -
  select * from CQLUSER where token(name)  token(deep);

ERROR -
Bad Request: Failed parsing statement: [select * from CQLUSER where
token(name)  token(deep);] reason: NullPointerException null
text could not be lexed at line 1, char 15

-- 
Thanks and Regards
Kuldeep Kumar Mishra
+919540965199


Re: describe keyspace or column family query not working

2013-04-10 Thread Tyler Hobbs
DESCRIBE is a cqlsh feature, not a part of the CQL language.


On Wed, Apr 10, 2013 at 2:37 AM, Kuldeep Mishra kuld.cs.mis...@gmail.comwrote:

 Hi ,
 I am trying to execute following query but not working and throwing
 exception

 QUERY:--
  Cassandra.Client client;
  client.execute_cql3_query(ByteBuffer.wrap(describe keyspace
 mykeyspace.getBytes(Constants.CHARSET_UTF8)),   Compression.NONE,
 ConsistencyLevel.ONE);

  client.execute_cql3_query(ByteBuffer.wrap(describe table
 mytable.getBytes(Constants.CHARSET_UTF8)),   Compression.NONE,
 ConsistencyLevel.ONE);

 but both query giving following exception,

 STACK TRACE

 InvalidRequestException(why:line 1:0 no viable alternative at input
 'describe')
 at
 org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37849)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at
 org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
 at
 org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)

 Please help..


 Thanks and Regards
 Kuldeep






 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: describe keyspace or column family query not working

2013-04-10 Thread Vivek Mishra
Ok. A column family and keyspace created via cqlsh using cql3 is visible
via cassandra-cli or thrift API?

-Vivek


On Wed, Apr 10, 2013 at 9:23 PM, Tyler Hobbs ty...@datastax.com wrote:

 DESCRIBE is a cqlsh feature, not a part of the CQL language.


 On Wed, Apr 10, 2013 at 2:37 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.comwrote:

 Hi ,
 I am trying to execute following query but not working and throwing
 exception

 QUERY:--
  Cassandra.Client client;
  client.execute_cql3_query(ByteBuffer.wrap(describe keyspace
 mykeyspace.getBytes(Constants.CHARSET_UTF8)),   Compression.NONE,
 ConsistencyLevel.ONE);

  client.execute_cql3_query(ByteBuffer.wrap(describe table
 mytable.getBytes(Constants.CHARSET_UTF8)),   Compression.NONE,
 ConsistencyLevel.ONE);

 but both query giving following exception,

 STACK TRACE

 InvalidRequestException(why:line 1:0 no viable alternative at input
 'describe')
 at
 org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37849)
 at
 org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at
 org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
 at
 org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)

 Please help..


 Thanks and Regards
 Kuldeep






 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Tyler Hobbs
 DataStax http://datastax.com/



Re: Getting NullPointerException while executing query

2013-04-10 Thread Sylvain Lebresne
On which version of Cassandra are you? I can't reproduce the
NullPointerException on Cassandra 1.2.3.

That being said, that query is not valid, so you will get an error message.
There is 2 reasons why it's not valid:
  1) in token(deep), deep is not a valid term. So you should have something
like: token('deep').
  2) the name column is not the partition key so the token method cannot be
applied to it.

A valid query with that schema would be for instance:
  select * from CQLUSER where token(id)  token(4)
though I don't know if that help in any way for what you aimed to do.

--
Sylvain


On Wed, Apr 10, 2013 at 9:42 AM, Kuldeep Mishra kuld.cs.mis...@gmail.comwrote:

 Hi ,
  TABLE -
 CREATE TABLE CQLUSER (
   id int PRIMARY KEY,
   age int,
   name text
 )
 Query -
   select * from CQLUSER where token(name)  token(deep);

 ERROR -
 Bad Request: Failed parsing statement: [select * from CQLUSER where
 token(name)  token(deep);] reason: NullPointerException null
 text could not be lexed at line 1, char 15

 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199



Re: describe keyspace or column family query not working

2013-04-10 Thread Tyler Hobbs
On Wed, Apr 10, 2013 at 11:09 AM, Vivek Mishra mishra.v...@gmail.comwrote:

 Ok. A column family and keyspace created via cqlsh using cql3 is visible
 via cassandra-cli or thrift API?


The column family will only be visible via cassandra-cli and the Thrift API
if it was created WITH COMPACT STORAGE:
http://www.datastax.com/docs/1.2/cql_cli/cql/CREATE_TABLE#using-compact-storage


-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Counter batches query

2013-04-08 Thread aaron morton
For #1 Storage Proxy (server wide) metrics are per request, so 1 in your 
example. CF level metrics are per row, so 5 in your example. 

Not sure what graph you were looking at in ops centre, probably best to ask on 
here http://www.datastax.com/support-forums/

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/04/2013, at 2:30 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 For #2
 There are tow mutates in thrift batch_mutate and atomic_batch_mutate. The 
 atomic version was just added. If you care more about the performance do not 
 use the atomic version..
 
 
 On Sat, Apr 6, 2013 at 12:03 AM, Matt K infinitelimittes...@gmail.com wrote:
 Hi,
 
 I have an application that does batch (counter) writes to multiple CFs. The 
 application itself is multi-threaded and I'm using C* 1.2.2 and Astyanax 
 driver. Could someone share insights on:
 
 1) When I see the cluster write throughput graph in opscenter, the number is 
 not reflective of actual number of writes. For example: If I issue a single 
 batch write ( internally have 5 mutation ), is the opscenter/JMX cluster/node 
 writes suppose to indicate 1 or 5 ? ( I would assume 5 ) 
 
 2) I read that from C* 1.2.x, there is atomic counter batches which can cause 
 30% performance hit - wondering if this applicable to existing thrift based 
 clients like Astyanax/Hector and if so, what is the way to turn it off? Any 
 server side settings too?
 
 Thanks!
 



Re: Counter batches query

2013-04-06 Thread Edward Capriolo
For #2
There are tow mutates in thrift batch_mutate and atomic_batch_mutate. The
atomic version was just added. If you care more about the performance do
not use the atomic version..


On Sat, Apr 6, 2013 at 12:03 AM, Matt K infinitelimittes...@gmail.comwrote:

 Hi,

 I have an application that does batch (counter) writes to multiple CFs.
 The application itself is multi-threaded and I'm using C* 1.2.2 and
 Astyanax driver. Could someone share insights on:

 1) When I see the cluster write throughput graph in opscenter, the number
 is not reflective of actual number of writes. For example: If I issue a
 single batch write ( internally have 5 mutation ), is the opscenter/JMX
 cluster/node writes suppose to indicate 1 or 5 ? ( I would assume 5 )

 2) I read that from C* 1.2.x, there is atomic counter batches which can
 cause 30% performance hit - wondering if this applicable to existing thrift
 based clients like Astyanax/Hector and if so, what is the way to turn it
 off? Any server side settings too?

 Thanks!



Re: Data Model and Query

2013-04-05 Thread aaron morton
 Whats the recommendation on querying a data model like  StartDate  “X” and 
 counter  “Y” .
 
 
it's not possible. 

If you are using secondary indexes you have to have an equals clause in the 
statement. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/04/2013, at 6:53 AM, shubham srivastava shubha...@gmail.com wrote:

 Hi,
 
  
 Whats the recommendation on querying a data model like  StartDate  “X” and 
 counter  “Y” . Its kind of range queries across multiple columns and key.
 
 I have the flexibility for modelling the data for the above query accordingly.
 
 
  
 Regards,
 
 Shubham
 



Re: Data Model and Query

2013-04-05 Thread Hiller, Dean
I would partition either with cassandra's partitioning or PlayOrm partitioning 
and query like so

Where beginOfMonth=x and startDateX and counter  Y.  This only 
returns stuff after X in that partition though so you may need to run multiple 
queries like this and if you have billions of rows it could take some 
time….instead you may want startData  X and startDate  Z such that Z and X 
are in the same month or if they span 2-3 partitions, then just run the 2-3 
queries.  I don't know enough detail on your use case to know if this works for 
you though.

Dean

From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Friday, April 5, 2013 10:59 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Data Model and Query


Whats the recommendation on querying a data model like  StartDate  “X” and 
counter  “Y” .

it's not possible.

If you are using secondary indexes you have to have an equals clause in the 
statement.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/04/2013, at 6:53 AM, shubham srivastava 
shubha...@gmail.commailto:shubha...@gmail.com wrote:


Hi,



Whats the recommendation on querying a data model like  StartDate  “X” and 
counter  “Y” . Its kind of range queries across multiple columns and key.

I have the flexibility for modelling the data for the above query accordingly.



Regards,

Shubham



Counter batches query

2013-04-05 Thread Matt K
Hi,

I have an application that does batch (counter) writes to multiple CFs. The
application itself is multi-threaded and I'm using C* 1.2.2 and Astyanax
driver. Could someone share insights on:

1) When I see the cluster write throughput graph in opscenter, the number
is not reflective of actual number of writes. For example: If I issue a
single batch write ( internally have 5 mutation ), is the opscenter/JMX
cluster/node writes suppose to indicate 1 or 5 ? ( I would assume 5 )

2) I read that from C* 1.2.x, there is atomic counter batches which can
cause 30% performance hit - wondering if this applicable to existing thrift
based clients like Astyanax/Hector and if so, what is the way to turn it
off? Any server side settings too?

Thanks!


Re: Unable to prefix in astyanax read query

2013-04-03 Thread aaron morton
 I have created this column family using CQL and defined the primary key
 as 
What was the create table statement ? 

 BadRequestException: [host=localhost(127.0.0.1):9160, latency=6(6),
 attempts=1]InvalidRequestException(why:Not enough bytes to read value of
 component 0)
Unless the CQL 3 create table statement specifies USE COMPACT_STORAGE it will 
use composites in the row keys and Astyanax may not be expected this. 

Unless astyanax specifically says it can write to CQL 3 tables it's best to 
only access them using CQL 3. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 2/04/2013, at 6:07 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 We ran into some similar errors in playorm development.  Basically, you
 defined a composite probably but are not correctly using that composite.
 I am not sure about queries though as we had the issue when saving data
 (ie. Using deviceID+deviceName did not work and we had to create an full
 blown composite object).  I think you need to read up on how astyanax
 works with compositesŠ..I am not sure this is a cassandra question
 reallyŠ.more of an astyanax one.
 
 Dean
 
 On 4/1/13 11:48 PM, Apurva Jalit apurva.ja...@gmail.com wrote:
 
 I have a scheme as follows:
 
 TimeStamp
 Device ID
 Device Name
 Device Owner
 Device location
 
 I have created this column family using CQL and defined the primary key
 as 
 (TimeStamp,Device ID, Device Name). Through a serializable object that
 has 
 fields for DeviceID, name and a field name (which stores either Device
 Owner or 
 Device Location). I have inserted some records using Astyanax.
 
 As per my understanding, the columns for a row are created by combining
 Device 
 ID, Device Name and field name as column name and the value to be the
 value for 
 that particular field. Thus for a particular timestamp and device, the
 column 
 names would be in the pattern (Device ID:Device Name: ...).
 
 So I believe we can use these 2 fields as prefix to obtain all the
 entries for a 
 particular time-device combination.
 
 I am using the following query to obtain the results:
 
 RowSliceQueryString, ApBaseData query = adu.keyspace
 .prepareQuery(columnFamily)
 .getKeySlice(timeStamp)
 .withColumnRange(new RangeBuilder()
  .setStart(deviceID+deviceName+_\u0)
  .setEnd(deviceID+deviceName+_\u)
  .setLimit(batch_size)
  .build());
 
 But on executing the above query I get the following Exception:
 
 BadRequestException: [host=localhost(127.0.0.1):9160, latency=6(6),
 attempts=1]InvalidRequestException(why:Not enough bytes to read value of
 component 0)
 
 Can any one help to understand where am I going wrong?
 
 
 



Data Model and Query

2013-04-03 Thread shubham srivastava
Hi,



Whats the recommendation on querying a data model like  StartDate  “X” and
counter  “Y” . Its kind of range queries across multiple columns and key.

I have the flexibility for modelling the data for the above
query accordingly.



Regards,

Shubham


Re: Unable to prefix in astyanax read query

2013-04-02 Thread Hiller, Dean
We ran into some similar errors in playorm development.  Basically, you
defined a composite probably but are not correctly using that composite.
I am not sure about queries though as we had the issue when saving data
(ie. Using deviceID+deviceName did not work and we had to create an full
blown composite object).  I think you need to read up on how astyanax
works with compositesŠ..I am not sure this is a cassandra question
reallyŠ.more of an astyanax one.

Dean

On 4/1/13 11:48 PM, Apurva Jalit apurva.ja...@gmail.com wrote:

I have a scheme as follows:

 TimeStamp
 Device ID
 Device Name
 Device Owner
 Device location

I have created this column family using CQL and defined the primary key
as 
(TimeStamp,Device ID, Device Name). Through a serializable object that
has 
fields for DeviceID, name and a field name (which stores either Device
Owner or 
Device Location). I have inserted some records using Astyanax.

As per my understanding, the columns for a row are created by combining
Device 
ID, Device Name and field name as column name and the value to be the
value for 
that particular field. Thus for a particular timestamp and device, the
column 
names would be in the pattern (Device ID:Device Name: ...).

So I believe we can use these 2 fields as prefix to obtain all the
entries for a 
particular time-device combination.

I am using the following query to obtain the results:

  RowSliceQueryString, ApBaseData query = adu.keyspace
  .prepareQuery(columnFamily)
  .getKeySlice(timeStamp)
  .withColumnRange(new RangeBuilder()
   .setStart(deviceID+deviceName+_\u0)
   .setEnd(deviceID+deviceName+_\u)
   .setLimit(batch_size)
   .build());

But on executing the above query I get the following Exception:

BadRequestException: [host=localhost(127.0.0.1):9160, latency=6(6),
attempts=1]InvalidRequestException(why:Not enough bytes to read value of
component 0)

Can any one help to understand where am I going wrong?





Unable to prefix in astyanax read query

2013-04-01 Thread Apurva Jalit
I have a scheme as follows:

 TimeStamp
 Device ID
 Device Name
 Device Owner
 Device location

I have created this column family using CQL and defined the primary key as 
(TimeStamp,Device ID, Device Name). Through a serializable object that has 
fields for DeviceID, name and a field name (which stores either Device Owner or 
Device Location). I have inserted some records using Astyanax.

As per my understanding, the columns for a row are created by combining Device 
ID, Device Name and field name as column name and the value to be the value for 
that particular field. Thus for a particular timestamp and device, the column 
names would be in the pattern (Device ID:Device Name: ...).

So I believe we can use these 2 fields as prefix to obtain all the entries for 
a 
particular time-device combination.

I am using the following query to obtain the results:

  RowSliceQueryString, ApBaseData query = adu.keyspace
  .prepareQuery(columnFamily)
  .getKeySlice(timeStamp)
  .withColumnRange(new RangeBuilder()
   .setStart(deviceID+deviceName+_\u0)
   .setEnd(deviceID+deviceName+_\u)
   .setLimit(batch_size)
   .build());

But on executing the above query I get the following Exception:

BadRequestException: [host=localhost(127.0.0.1):9160, latency=6(6), 
attempts=1]InvalidRequestException(why:Not enough bytes to read value of 
component 0)

Can any one help to understand where am I going wrong?




Re: Digest Query Seems to be corrupt on certain cases

2013-03-31 Thread aaron morton
 When I manually inspected this byte array, it seems hold all details 
 correctly, except the super-column name, causing it to fetch the entire wide 
 row.
What is the CF definition and what is the exact query you are sending? 
There does not appear to be anything obvious in the QueryPath serde for 1.0.7

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 28/03/2013, at 10:54 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 VM Settings are
 -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities 
 -XX:ThreadPriorityPolicy=42 -Xms8G -Xmx8G -Xmn800M 
 -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
 -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
 
 error stack was containing 2 threads for the same key, stalling on digest 
 query
 
 The below bytes which I referred is the actual value of _body variable in 
 org.apache.cassandra.net.Message object got from the heap dump.
 
 As I understand from the code, ReadVerbHandler will deserialize this _body 
 variable into a SliceByNamesReadCommand object.
 
 When I manually inspected this byte array, it seems hold all details 
 correctly, except the super-column name, causing it to fetch the entire wide 
 row.
 
 --
 Ravi
 
 On Thu, Mar 28, 2013 at 8:36 AM, aaron morton aa...@thelastpickle.com wrote:
 We started receiving OOMs in our cassandra grid and took a heap dump
 What are the JVM settings ? 
 What was the error stack? 
 
 I am pasting the serialized byte array of SliceByNamesReadCommand, which 
 seems to be corrupt on issuing certain digest queries.
 
 Sorry I don't follow what you are saying here. 
 Can you can you enable DEBUG logging and identify the behaviour you think is 
 incorrect ?
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 28/03/2013, at 4:15 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:
 
 We started receiving OOMs in our cassandra grid and took a heap dump. We are 
 running version 1.0.7 with LOCAL_QUORUM from both reads/writes.
 
 After some analysis, we kind of identified the problem, with 
 SliceByNamesReadCommand, involving a single Super-Column. This seems to be 
 happening only in digest query and not during actual reads.
 
 I am pasting the serialized byte array of SliceByNamesReadCommand, which 
 seems to be corrupt on issuing certain digest queries.
 
  //Type is SliceByNamesReadCommand
  body[0] = (byte)1;
  
  //This is a digest query here.
  body[1] = (byte)1;
 
 //Table-Name from 2-8 bytes
 
 //Key-Name from 9-18 bytes
 
 //QueryPath deserialization here
  
  //CF-Name from 19-30 bytes
 
 //Super-Col-Name from 31st byte onwards, but gets 
 corrupt as found in heap dump
 
 //body[32-37] = 0, body[38] = 1, body[39] = 0.  This 
 causes the SliceByNamesDeserializer to mark both ColName=NULL and 
 SuperColName=NULL, fetching entire wide-row!!!
 
//Actual super-col-name starts only from byte 40, whereas 
 it should have started from 31st byte itself
 
 Has someone already encountered such an issue? Why is the super-col-name not 
 correctly de-serialized during digest query.
 
 --
 Ravi
 
 
 



Digest Query Seems to be corrupt on certain cases

2013-03-27 Thread Ravikumar Govindarajan
We started receiving OOMs in our cassandra grid and took a heap dump. We
are running version 1.0.7 with LOCAL_QUORUM from both reads/writes.

After some analysis, we kind of identified the problem, with
SliceByNamesReadCommand, involving a single Super-Column. This seems to be
happening only in digest query and not during actual reads.

I am pasting the serialized byte array of SliceByNamesReadCommand, which
seems to be corrupt on issuing certain digest queries.

//Type is SliceByNamesReadCommand
body[0] = (byte)1;
 //This is a digest query here.
body[1] = (byte)1;

//Table-Name from 2-8 bytes

//Key-Name from 9-18 bytes

//QueryPath deserialization here

 //CF-Name from 19-30 bytes

//Super-Col-Name from 31st byte onwards, but gets
corrupt as found in heap dump

//body[32-37] = 0, body[38] = 1, body[39] = 0.  This
causes the SliceByNamesDeserializer to mark both ColName=NULL and
SuperColName=NULL, fetching entire wide-row!!!

   //Actual super-col-name starts only from byte 40,
whereas it should have started from 31st byte itself

Has someone already encountered such an issue? Why is the super-col-name
not correctly de-serialized during digest query.

--
Ravi


Re: Digest Query Seems to be corrupt on certain cases

2013-03-27 Thread aaron morton
 We started receiving OOMs in our cassandra grid and took a heap dump
What are the JVM settings ? 
What was the error stack? 

 I am pasting the serialized byte array of SliceByNamesReadCommand, which 
 seems to be corrupt on issuing certain digest queries.
Sorry I don't follow what you are saying here. 
Can you can you enable DEBUG logging and identify the behaviour you think is 
incorrect ?

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 28/03/2013, at 4:15 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 We started receiving OOMs in our cassandra grid and took a heap dump. We are 
 running version 1.0.7 with LOCAL_QUORUM from both reads/writes.
 
 After some analysis, we kind of identified the problem, with 
 SliceByNamesReadCommand, involving a single Super-Column. This seems to be 
 happening only in digest query and not during actual reads.
 
 I am pasting the serialized byte array of SliceByNamesReadCommand, which 
 seems to be corrupt on issuing certain digest queries.
 
   //Type is SliceByNamesReadCommand
   body[0] = (byte)1;
   
   //This is a digest query here.
   body[1] = (byte)1;
 
 //Table-Name from 2-8 bytes
 
 //Key-Name from 9-18 bytes
 
 //QueryPath deserialization here
  
  //CF-Name from 19-30 bytes
 
 //Super-Col-Name from 31st byte onwards, but gets corrupt 
 as found in heap dump
 
 //body[32-37] = 0, body[38] = 1, body[39] = 0.  This 
 causes the SliceByNamesDeserializer to mark both ColName=NULL and 
 SuperColName=NULL, fetching entire wide-row!!!
 
//Actual super-col-name starts only from byte 40, whereas 
 it should have started from 31st byte itself
 
 Has someone already encountered such an issue? Why is the super-col-name not 
 correctly de-serialized during digest query.
 
 --
 Ravi
 



Re: Digest Query Seems to be corrupt on certain cases

2013-03-27 Thread Ravikumar Govindarajan
VM Settings are
-javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42 -Xms8G -Xmx8G -Xmn800M
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

error stack was containing 2 threads for the same key, stalling on digest
query

The below bytes which I referred is the actual value of _body variable in
org.apache.cassandra.net.Message object got from the heap dump.

As I understand from the code, ReadVerbHandler will deserialize this
_body variable into a SliceByNamesReadCommand object.

When I manually inspected this byte array, it seems hold all details
correctly, except the super-column name, causing it to fetch the entire
wide row.

--
Ravi

On Thu, Mar 28, 2013 at 8:36 AM, aaron morton aa...@thelastpickle.comwrote:

 We started receiving OOMs in our cassandra grid and took a heap dump

 What are the JVM settings ?
 What was the error stack?

 I am pasting the serialized byte array of SliceByNamesReadCommand, which
 seems to be corrupt on issuing certain digest queries.

 Sorry I don't follow what you are saying here.
 Can you can you enable DEBUG logging and identify the behaviour you think
 is incorrect ?

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 28/03/2013, at 4:15 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 We started receiving OOMs in our cassandra grid and took a heap dump. We
 are running version 1.0.7 with LOCAL_QUORUM from both reads/writes.

 After some analysis, we kind of identified the problem, with
 SliceByNamesReadCommand, involving a single Super-Column. This seems to be
 happening only in digest query and not during actual reads.

 I am pasting the serialized byte array of SliceByNamesReadCommand, which
 seems to be corrupt on issuing certain digest queries.

 //Type is SliceByNamesReadCommand
  body[0] = (byte)1;
  //This is a digest query here.
  body[1] = (byte)1;

 //Table-Name from 2-8 bytes

 //Key-Name from 9-18 bytes

 //QueryPath deserialization here

  //CF-Name from 19-30 bytes

 //Super-Col-Name from 31st byte onwards, but gets
 corrupt as found in heap dump

 //body[32-37] = 0, body[38] = 1, body[39] = 0.  This
 causes the SliceByNamesDeserializer to mark both ColName=NULL and
 SuperColName=NULL, fetching entire wide-row!!!

//Actual super-col-name starts only from byte 40,
 whereas it should have started from 31st byte itself

 Has someone already encountered such an issue? Why is the super-col-name
 not correctly de-serialized during digest query.

 --
 Ravi





Re: cql query not giving any result.

2013-03-18 Thread Sylvain Lebresne
CQL can't work correctly if 2 (CQL) columns have the same name. Now, to
allow upgrade from thrift, CQL does use some default names like key for
the Row key when there isn't anything else.

Honestly I think the easiest workaround here is probably to disambiguate
things manually. Typically, you could update the column family definition
to set the key_alias (in CfDef) to some name that make sense for you. This
will end up being the name of the Row key for CQL. You may also try issue a
RENAME from CQL to rename the row key, which should work. Typically
something like ALTER KunderaExamples RENAME key TO rowKey.

--
Sylvain



On Sat, Mar 16, 2013 at 4:39 AM, Vivek Mishra mishra.v...@gmail.com wrote:

 Any suggestions?
 -Vivek

 On Fri, Mar 15, 2013 at 5:20 PM, Vivek Mishra mishra.v...@gmail.comwrote:

 Ok. So it's a case  when, CQL returns rowkey value as key and there is
 also column present with name as key.

 Sounds like a bug?

 -Vivek


 On Fri, Mar 15, 2013 at 5:17 PM, Kuldeep Mishra kuld.cs.mis...@gmail.com
  wrote:

 Hi Sylvain,
   I created it using thrift client, here is column family creation
 script,

 Cassandra.Client client;
 CfDef user_Def = new CfDef();
 user_Def.name = DOCTOR;
 user_Def.keyspace = KunderaExamples;
 user_Def.setComparator_type(UTF8Type);
 user_Def.setDefault_validation_class(UTF8Type);
 user_Def.setKey_validation_class(UTF8Type);
 ColumnDef key = new ColumnDef(ByteBuffer.wrap(KEY.getBytes()),
 UTF8Type);
 key.index_type = IndexType.KEYS;
 ColumnDef age = new ColumnDef(ByteBuffer.wrap(AGE.getBytes()),
 UTF8Type);
 age.index_type = IndexType.KEYS;
 user_Def.addToColumn_metadata(key);
 user_Def.addToColumn_metadata(age);

 client.set_keyspace(KunderaExamples);
 client.system_add_column_family(user_Def);


 Thanks
 KK


 On Fri, Mar 15, 2013 at 4:24 PM, Sylvain Lebresne 
 sylv...@datastax.comwrote:

 On Fri, Mar 15, 2013 at 11:43 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name ?,
 like in this particular scenario I have two column with same name as 
 key,
 first one is rowkey and second on is column name .


 No, it shouldn't be possible and that is your problem. How did you
 created that table?

 --
 Sylvain



 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199






Re: cql query not giving any result.

2013-03-18 Thread Vivek Mishra
If this is the case, Why can't we restrict key as a keyword and not to be
used as a column name?

-Vivek

On Mon, Mar 18, 2013 at 2:37 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 CQL can't work correctly if 2 (CQL) columns have the same name. Now, to
 allow upgrade from thrift, CQL does use some default names like key for
 the Row key when there isn't anything else.

 Honestly I think the easiest workaround here is probably to disambiguate
 things manually. Typically, you could update the column family definition
 to set the key_alias (in CfDef) to some name that make sense for you. This
 will end up being the name of the Row key for CQL. You may also try issue a
 RENAME from CQL to rename the row key, which should work. Typically
 something like ALTER KunderaExamples RENAME key TO rowKey.

 --
 Sylvain



 On Sat, Mar 16, 2013 at 4:39 AM, Vivek Mishra mishra.v...@gmail.comwrote:

 Any suggestions?
 -Vivek

 On Fri, Mar 15, 2013 at 5:20 PM, Vivek Mishra mishra.v...@gmail.comwrote:

 Ok. So it's a case  when, CQL returns rowkey value as key and there is
 also column present with name as key.

 Sounds like a bug?

 -Vivek


 On Fri, Mar 15, 2013 at 5:17 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:

 Hi Sylvain,
   I created it using thrift client, here is column family creation
 script,

 Cassandra.Client client;
 CfDef user_Def = new CfDef();
 user_Def.name = DOCTOR;
 user_Def.keyspace = KunderaExamples;
 user_Def.setComparator_type(UTF8Type);
 user_Def.setDefault_validation_class(UTF8Type);
 user_Def.setKey_validation_class(UTF8Type);
 ColumnDef key = new
 ColumnDef(ByteBuffer.wrap(KEY.getBytes()), UTF8Type);
 key.index_type = IndexType.KEYS;
 ColumnDef age = new
 ColumnDef(ByteBuffer.wrap(AGE.getBytes()), UTF8Type);
 age.index_type = IndexType.KEYS;
 user_Def.addToColumn_metadata(key);
 user_Def.addToColumn_metadata(age);

 client.set_keyspace(KunderaExamples);
 client.system_add_column_family(user_Def);


 Thanks
 KK


 On Fri, Mar 15, 2013 at 4:24 PM, Sylvain Lebresne sylv...@datastax.com
  wrote:

 On Fri, Mar 15, 2013 at 11:43 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name ?,
 like in this particular scenario I have two column with same name as 
 key,
 first one is rowkey and second on is column name .


 No, it shouldn't be possible and that is your problem. How did you
 created that table?

 --
 Sylvain



 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where
 key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199







Re: cql query not giving any result.

2013-03-18 Thread Sylvain Lebresne
 If this is the case, Why can't we restrict key as a keyword and not to
 be used as a column name?


This is only a problem when upgrading from thrift to CQL. Forbidding key
as a column name in thrift would be weird to say the least.
What could be done is that CQL could, when it picks the default name it
uses, pick one that is not used already. That's definitively possible
and please do open a JIRA ticket for that.

But at the end of the day, if you are going to use CQL, I highly suggest
picking meaningful names for your CQL columns, so you will want
to rename the default name that CQL picks for the row key initially.

--
Sylvain



 -Vivek


 On Mon, Mar 18, 2013 at 2:37 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 CQL can't work correctly if 2 (CQL) columns have the same name. Now, to
 allow upgrade from thrift, CQL does use some default names like key for
 the Row key when there isn't anything else.

 Honestly I think the easiest workaround here is probably to disambiguate
 things manually. Typically, you could update the column family definition
 to set the key_alias (in CfDef) to some name that make sense for you. This
 will end up being the name of the Row key for CQL. You may also try issue a
 RENAME from CQL to rename the row key, which should work. Typically
 something like ALTER KunderaExamples RENAME key TO rowKey.

 --
 Sylvain



 On Sat, Mar 16, 2013 at 4:39 AM, Vivek Mishra mishra.v...@gmail.comwrote:

 Any suggestions?
 -Vivek

 On Fri, Mar 15, 2013 at 5:20 PM, Vivek Mishra mishra.v...@gmail.comwrote:

 Ok. So it's a case  when, CQL returns rowkey value as key and there
 is also column present with name as key.

 Sounds like a bug?

 -Vivek


 On Fri, Mar 15, 2013 at 5:17 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:

 Hi Sylvain,
   I created it using thrift client, here is column family creation
 script,

 Cassandra.Client client;
 CfDef user_Def = new CfDef();
 user_Def.name = DOCTOR;
 user_Def.keyspace = KunderaExamples;
 user_Def.setComparator_type(UTF8Type);
 user_Def.setDefault_validation_class(UTF8Type);
 user_Def.setKey_validation_class(UTF8Type);
 ColumnDef key = new
 ColumnDef(ByteBuffer.wrap(KEY.getBytes()), UTF8Type);
 key.index_type = IndexType.KEYS;
 ColumnDef age = new
 ColumnDef(ByteBuffer.wrap(AGE.getBytes()), UTF8Type);
 age.index_type = IndexType.KEYS;
 user_Def.addToColumn_metadata(key);
 user_Def.addToColumn_metadata(age);

 client.set_keyspace(KunderaExamples);
 client.system_add_column_family(user_Def);


 Thanks
 KK


 On Fri, Mar 15, 2013 at 4:24 PM, Sylvain Lebresne 
 sylv...@datastax.com wrote:

 On Fri, Mar 15, 2013 at 11:43 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name
 ?, like in this particular scenario I have two column with same name as
 key, first one is rowkey and second on is column name .


 No, it shouldn't be possible and that is your problem. How did you
 created that table?

 --
 Sylvain



 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where
 key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199








Re: cql query not giving any result.

2013-03-15 Thread Kuldeep Mishra
Hi,
Is it possible in Cassandra to make multiple column with same name ?, like
in this particular scenario I have two column with same name as key,
first one is rowkey and second on is column name .


Thanks and Regards
Kuldeep

On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra kuld.cs.mis...@gmail.comwrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




-- 
Thanks and Regards
Kuldeep Kumar Mishra
+919540965199


Re: cql query not giving any result.

2013-03-15 Thread Jason Wee
Here is a list of keywords and whether or not the words are reserved. A
reserved keyword cannot be used as an identifier unless you enclose the
word in double quotation marks. Non-reserved keywords have a specific
meaning in certain context but can be used as an identifier outside this
context.

http://www.datastax.com/docs/1.2/cql_cli/cql_lexicon#cql-keywords


On Fri, Mar 15, 2013 at 6:43 PM, Kuldeep Mishra kuld.cs.mis...@gmail.comwrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name ?, like
 in this particular scenario I have two column with same name as key,
 first one is rowkey and second on is column name .


 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.comwrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199



Re: cql query not giving any result.

2013-03-15 Thread Sylvain Lebresne
On Fri, Mar 15, 2013 at 11:43 AM, Kuldeep Mishra
kuld.cs.mis...@gmail.comwrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name ?, like
 in this particular scenario I have two column with same name as key,
 first one is rowkey and second on is column name .


No, it shouldn't be possible and that is your problem. How did you created
that table?

--
Sylvain



 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.comwrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199



Re: cql query not giving any result.

2013-03-15 Thread Kuldeep Mishra
Hi Sylvain,
  I created it using thrift client, here is column family creation
script,

Cassandra.Client client;
CfDef user_Def = new CfDef();
user_Def.name = DOCTOR;
user_Def.keyspace = KunderaExamples;
user_Def.setComparator_type(UTF8Type);
user_Def.setDefault_validation_class(UTF8Type);
user_Def.setKey_validation_class(UTF8Type);
ColumnDef key = new ColumnDef(ByteBuffer.wrap(KEY.getBytes()),
UTF8Type);
key.index_type = IndexType.KEYS;
ColumnDef age = new ColumnDef(ByteBuffer.wrap(AGE.getBytes()),
UTF8Type);
age.index_type = IndexType.KEYS;
user_Def.addToColumn_metadata(key);
user_Def.addToColumn_metadata(age);

client.set_keyspace(KunderaExamples);
client.system_add_column_family(user_Def);


Thanks
KK

On Fri, Mar 15, 2013 at 4:24 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Fri, Mar 15, 2013 at 11:43 AM, Kuldeep Mishra kuld.cs.mis...@gmail.com
  wrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name ?,
 like in this particular scenario I have two column with same name as key,
 first one is rowkey and second on is column name .


 No, it shouldn't be possible and that is your problem. How did you created
 that table?

 --
 Sylvain



 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra kuld.cs.mis...@gmail.com
  wrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





-- 
Thanks and Regards
Kuldeep Kumar Mishra
+919540965199


Re: cql query not giving any result.

2013-03-15 Thread Vivek Mishra
Ok. So it's a case  when, CQL returns rowkey value as key and there is
also column present with name as key.

Sounds like a bug?

-Vivek

On Fri, Mar 15, 2013 at 5:17 PM, Kuldeep Mishra kuld.cs.mis...@gmail.comwrote:

 Hi Sylvain,
   I created it using thrift client, here is column family creation
 script,

 Cassandra.Client client;
 CfDef user_Def = new CfDef();
 user_Def.name = DOCTOR;
 user_Def.keyspace = KunderaExamples;
 user_Def.setComparator_type(UTF8Type);
 user_Def.setDefault_validation_class(UTF8Type);
 user_Def.setKey_validation_class(UTF8Type);
 ColumnDef key = new ColumnDef(ByteBuffer.wrap(KEY.getBytes()),
 UTF8Type);
 key.index_type = IndexType.KEYS;
 ColumnDef age = new ColumnDef(ByteBuffer.wrap(AGE.getBytes()),
 UTF8Type);
 age.index_type = IndexType.KEYS;
 user_Def.addToColumn_metadata(key);
 user_Def.addToColumn_metadata(age);

 client.set_keyspace(KunderaExamples);
 client.system_add_column_family(user_Def);


 Thanks
 KK


 On Fri, Mar 15, 2013 at 4:24 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Fri, Mar 15, 2013 at 11:43 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name ?,
 like in this particular scenario I have two column with same name as key,
 first one is rowkey and second on is column name .


 No, it shouldn't be possible and that is your problem. How did you
 created that table?

 --
 Sylvain



 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199



Re: cql query not giving any result.

2013-03-15 Thread Vivek Mishra
Any suggestions?
-Vivek

On Fri, Mar 15, 2013 at 5:20 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 Ok. So it's a case  when, CQL returns rowkey value as key and there is
 also column present with name as key.

 Sounds like a bug?

 -Vivek


 On Fri, Mar 15, 2013 at 5:17 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.comwrote:

 Hi Sylvain,
   I created it using thrift client, here is column family creation
 script,

 Cassandra.Client client;
 CfDef user_Def = new CfDef();
 user_Def.name = DOCTOR;
 user_Def.keyspace = KunderaExamples;
 user_Def.setComparator_type(UTF8Type);
 user_Def.setDefault_validation_class(UTF8Type);
 user_Def.setKey_validation_class(UTF8Type);
 ColumnDef key = new ColumnDef(ByteBuffer.wrap(KEY.getBytes()),
 UTF8Type);
 key.index_type = IndexType.KEYS;
 ColumnDef age = new ColumnDef(ByteBuffer.wrap(AGE.getBytes()),
 UTF8Type);
 age.index_type = IndexType.KEYS;
 user_Def.addToColumn_metadata(key);
 user_Def.addToColumn_metadata(age);

 client.set_keyspace(KunderaExamples);
 client.system_add_column_family(user_Def);


 Thanks
 KK


 On Fri, Mar 15, 2013 at 4:24 PM, Sylvain Lebresne 
 sylv...@datastax.comwrote:

 On Fri, Mar 15, 2013 at 11:43 AM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:

 Hi,
 Is it possible in Cassandra to make multiple column with same name ?,
 like in this particular scenario I have two column with same name as key,
 first one is rowkey and second on is column name .


 No, it shouldn't be possible and that is your problem. How did you
 created that table?

 --
 Sylvain



 Thanks and Regards
 Kuldeep


 On Fri, Mar 15, 2013 at 4:05 PM, Kuldeep Mishra 
 kuld.cs.mis...@gmail.com wrote:


 Hi ,
 Following cql query not returning any result
 cqlsh:KunderaExamples select * from DOCTOR where key='kuldeep';

I have enabled secondary indexes on both column.

 Screen shot is attached

 Please help


 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199




 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





 --
 Thanks and Regards
 Kuldeep Kumar Mishra
 +919540965199





Re: CQL query issue

2013-03-05 Thread Vivek Mishra
Thank you  i am able to solve this one.
If i am trying as :

SELECT * FROM CompositeUser WHERE userId='mevivs' LIMIT 100 ALLOW
FILTERING

it works. Somehow got confused by
http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT, which states as :

SELECT select_expression
  FROM *keyspace_name.*table_name
  *WHERE clause AND clause ...*
*ALLOW FILTERING**LIMIT n*
  *ORDER BY compound_key_2 ASC | DESC*

*
*

*is this an issue?*

*
*

*-Vivek*



On Tue, Mar 5, 2013 at 5:21 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 Hi,
 I am trying to execute a cql3 query as :

 SELECT * FROM CompositeUser WHERE userId='mevivs' ALLOW FILTERING
 LIMIT 100

 and getting given below error:

 Caused by: InvalidRequestException(why:line 1:70 missing EOF at 'LIMIT')
 at
 org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37849)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at
 org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
  at
 org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)


 Is there something incorrect in syntax?



Re: CQL query issue

2013-03-05 Thread Vivek Mishra
Somebody in group, please confirm if it is an issue or that needs rectified
for select syntax.

-Vivek

On Tue, Mar 5, 2013 at 5:31 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 Thank you  i am able to solve this one.
 If i am trying as :

 SELECT * FROM CompositeUser WHERE userId='mevivs' LIMIT 100 ALLOW
 FILTERING

 it works. Somehow got confused by
 http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT, which states as :

 SELECT select_expression
   FROM *keyspace_name.*table_name
   *WHERE clause AND clause ...*
 *ALLOW FILTERING**LIMIT n*
   *ORDER BY compound_key_2 ASC | DESC*

 *
 *

 *is this an issue?*

 *
 *

 *-Vivek*



 On Tue, Mar 5, 2013 at 5:21 PM, Vivek Mishra mishra.v...@gmail.comwrote:

 Hi,
 I am trying to execute a cql3 query as :

 SELECT * FROM CompositeUser WHERE userId='mevivs' ALLOW FILTERING
 LIMIT 100

 and getting given below error:

 Caused by: InvalidRequestException(why:line 1:70 missing EOF at 'LIMIT')
 at
 org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37849)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at
 org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
  at
 org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)


 Is there something incorrect in syntax?





Re: CQL query issue

2013-03-05 Thread Sylvain Lebresne
This is not an issue of Cassandra. In particular
http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt is up to date.
It is an issue of the datastax documentation however. I'll see with them
that this gets resolved.


On Tue, Mar 5, 2013 at 3:26 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 Somebody in group, please confirm if it is an issue or that needs
 rectified for select syntax.

 -Vivek


 On Tue, Mar 5, 2013 at 5:31 PM, Vivek Mishra mishra.v...@gmail.comwrote:

 Thank you  i am able to solve this one.
 If i am trying as :

 SELECT * FROM CompositeUser WHERE userId='mevivs' LIMIT 100 ALLOW
 FILTERING

 it works. Somehow got confused by
 http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT, which states as :

 SELECT select_expression
   FROM *keyspace_name.*table_name
   *WHERE clause AND clause ...*
 *ALLOW FILTERING**LIMIT n*
   *ORDER BY compound_key_2 ASC | DESC*

 *
 *

 *is this an issue?*

 *
 *

 *-Vivek*



 On Tue, Mar 5, 2013 at 5:21 PM, Vivek Mishra mishra.v...@gmail.comwrote:

 Hi,
 I am trying to execute a cql3 query as :

 SELECT * FROM CompositeUser WHERE userId='mevivs' ALLOW FILTERING
 LIMIT 100

 and getting given below error:

 Caused by: InvalidRequestException(why:line 1:70 missing EOF at 'LIMIT')
 at
 org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37849)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at
 org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
  at
 org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)


 Is there something incorrect in syntax?






Re: Column Slice Query performance after deletions

2013-03-03 Thread aaron morton
 I need something to keep the deleted columns away from my query fetch. Not 
 only the tombstones.
 It looks like the min compaction might help on this. But I'm not sure yet on 
 what would be a reasonable value for its threeshold.
Your tombstones will not be purged in a compaction until after gc_grace and 
only if all fragments of the row are in the compaction. You right that you 
would probably want to run repair during the day if you are going to 
dramatically reduce gc_grace to avoid deleted data coming back to life. 

If you are using a single cassandra row as a queue, you are going to have 
trouble. Levelled compaction may help a little. 

If you are reading the most recent entries in the row, assuming the columns 
are sorted by some time stamp. Use the Reverse Comparator and issue slice 
commands to get the first X cols. That will remove tombstones from the problem. 
(Am guessing this is not something you do, just mentioning it). 

You next option is to change the data model so you don't use the same row all 
day. 

After that, consider a message queue. 

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 2/03/2013, at 12:03 PM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com 
wrote:

 Tombstones stay around until gc grace so you could lower that to see of that 
 fixes the performance issues.
 
 If the tombstones get collected,the column will live again, causing data 
 inconsistency since I cant run a repair during the regular operations. Not 
 sure if I got your thoughts on this.
 
 
 Size tiered or leveled comparison?
 
 I'm actuallly running on Size Tiered Compaction, but I've been looking into 
 changing it for Leveled. It seems to be the case.  Although even if I achieve 
 some performance, I would still have the same problem with the deleted 
 columns.
 
 
 I need something to keep the deleted columns away from my query fetch. Not 
 only the tombstones.
 It looks like the min compaction might help on this. But I'm not sure yet on 
 what would be a reasonable value for its threeshold.
 
 
 On Sat, Mar 2, 2013 at 4:22 PM, Michael Kjellman mkjell...@barracuda.com 
 wrote:
 Tombstones stay around until gc grace so you could lower that to see of that 
 fixes the performance issues.
 
 Size tiered or leveled comparison?
 
 On Mar 2, 2013, at 11:15 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:
 
 What is your gc_grace set to? Sounds like as the number of tombstones 
 records increase your performance decreases. (Which I would expect)
 
 gr_grace is default.
 
 
 Casandra's data files are write once. Deletes are another write. Until 
 compaction they all live on disk.Making really big rows has these problem.
 Oh, so it looks like I should lower the min_compaction_threshold for this 
 column family. Right?
 What does realy mean this threeshold value?
 
 
 Guys, thanks for the help so far.
 
 On Sat, Mar 2, 2013 at 3:42 PM, Michael Kjellman mkjell...@barracuda.com 
 wrote:
 What is your gc_grace set to? Sounds like as the number of tombstones 
 records increase your performance decreases. (Which I would expect)
 
 On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:
 
 I have a daily maintenance of my cluster where I truncate this column 
 family. Because its data doesnt need to be kept more than a day. 
 Since all the regular operations on it finishes around 4 hours before 
 finishing the day. I regurlarly run a truncate on it followed by a repair 
 at the end of the day.
 
 And every day, when the operations are started(when are only few deleted 
 columns), the performance looks pretty well.
 Unfortunately it is degraded along the day.
 
 
 On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman mkjell...@barracuda.com 
 wrote:
 When is the last time you did a cleanup on the cf?
 
 On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:
 
  Hello guys.
  I'm investigating the reasons of performance degradation for my case 
  scenario which follows:
 
  - I do have a column family which is filled of thousands of columns 
  inside a unique row(varies between 10k ~ 200k). And I do have also 
  thousands of rows, not much more than 15k.
  - This rows are constantly updated. But the write-load is not that 
  intensive. I estimate it as 100w/sec in the column family.
  - Each column represents a message which is read and processed by another 
  process. After reading it, the column is marked for deletion in order to 
  keep it out from the next query on this row.
 
  Ok, so, I've been figured out that after many insertions plus deletion 
  updates, my queries( column slice query ) are taking more time to be 
  performed. Even if there are only few columns, lower than 100.
 
  So it looks like that the longer is the number of columns being deleted, 
  the longer is the time spent for a query.
  - Internally at C*, does column slice query ranges among deleted

Column Slice Query performance after deletions

2013-03-02 Thread Víctor Hugo Oliveira Molinar
Hello guys.
I'm investigating the reasons of performance degradation for my case
scenario which follows:

- I do have a column family which is filled of thousands of columns inside
a unique row(varies between 10k ~ 200k). And I do have also thousands of
rows, not much more than 15k.
- This rows are constantly updated. But the write-load is not that
intensive. I estimate it as 100w/sec in the column family.
- Each column represents a message which is read and processed by another
process. After reading it, the column is marked for deletion in order to
keep it out from the next query on this row.

Ok, so, I've been figured out that after many insertions plus deletion
updates, my queries( column slice query ) are taking more time to be
performed. Even if there are only few columns, lower than 100.

So it looks like that the longer is the number of columns being deleted,
the longer is the time spent for a query.
- Internally at C*, does column slice query ranges among deleted columns?
If so, how can I mitigate the impact in my queries? Or, how can I avoid
those deleted columns?


Re: Column Slice Query performance after deletions

2013-03-02 Thread Michael Kjellman
When is the last time you did a cleanup on the cf?

On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.com wrote:

 Hello guys.
 I'm investigating the reasons of performance degradation for my case scenario 
 which follows:
 
 - I do have a column family which is filled of thousands of columns inside a 
 unique row(varies between 10k ~ 200k). And I do have also thousands of rows, 
 not much more than 15k.
 - This rows are constantly updated. But the write-load is not that intensive. 
 I estimate it as 100w/sec in the column family.
 - Each column represents a message which is read and processed by another 
 process. After reading it, the column is marked for deletion in order to keep 
 it out from the next query on this row.
 
 Ok, so, I've been figured out that after many insertions plus deletion 
 updates, my queries( column slice query ) are taking more time to be 
 performed. Even if there are only few columns, lower than 100.
 
 So it looks like that the longer is the number of columns being deleted, the 
 longer is the time spent for a query.
 - Internally at C*, does column slice query ranges among deleted columns?
 If so, how can I mitigate the impact in my queries? Or, how can I avoid those 
 deleted columns?

Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.


Re: Column Slice Query performance after deletions

2013-03-02 Thread Víctor Hugo Oliveira Molinar
I have a daily maintenance of my cluster where I truncate this column
family. Because its data doesnt need to be kept more than a day.
Since all the regular operations on it finishes around 4 hours before
finishing the day. I regurlarly run a truncate on it followed by a repair
at the end of the day.

And every day, when the operations are started(when are only few deleted
columns), the performance looks pretty well.
Unfortunately it is degraded along the day.


On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman mkjell...@barracuda.comwrote:

 When is the last time you did a cleanup on the cf?

 On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

  Hello guys.
  I'm investigating the reasons of performance degradation for my case
 scenario which follows:
 
  - I do have a column family which is filled of thousands of columns
 inside a unique row(varies between 10k ~ 200k). And I do have also
 thousands of rows, not much more than 15k.
  - This rows are constantly updated. But the write-load is not that
 intensive. I estimate it as 100w/sec in the column family.
  - Each column represents a message which is read and processed by
 another process. After reading it, the column is marked for deletion in
 order to keep it out from the next query on this row.
 
  Ok, so, I've been figured out that after many insertions plus deletion
 updates, my queries( column slice query ) are taking more time to be
 performed. Even if there are only few columns, lower than 100.
 
  So it looks like that the longer is the number of columns being deleted,
 the longer is the time spent for a query.
  - Internally at C*, does column slice query ranges among deleted
 columns?
  If so, how can I mitigate the impact in my queries? Or, how can I avoid
 those deleted columns?

 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com.



Re: Column Slice Query performance after deletions

2013-03-02 Thread Michael Kjellman
What is your gc_grace set to? Sounds like as the number of tombstones records 
increase your performance decreases. (Which I would expect)

On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.commailto:vhmoli...@gmail.com wrote:

I have a daily maintenance of my cluster where I truncate this column family. 
Because its data doesnt need to be kept more than a day.
Since all the regular operations on it finishes around 4 hours before finishing 
the day. I regurlarly run a truncate on it followed by a repair at the end of 
the day.

And every day, when the operations are started(when are only few deleted 
columns), the performance looks pretty well.
Unfortunately it is degraded along the day.


On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman 
mkjell...@barracuda.commailto:mkjell...@barracuda.com wrote:
When is the last time you did a cleanup on the cf?

On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.commailto:vhmoli...@gmail.com wrote:

 Hello guys.
 I'm investigating the reasons of performance degradation for my case scenario 
 which follows:

 - I do have a column family which is filled of thousands of columns inside a 
 unique row(varies between 10k ~ 200k). And I do have also thousands of rows, 
 not much more than 15k.
 - This rows are constantly updated. But the write-load is not that intensive. 
 I estimate it as 100w/sec in the column family.
 - Each column represents a message which is read and processed by another 
 process. After reading it, the column is marked for deletion in order to keep 
 it out from the next query on this row.

 Ok, so, I've been figured out that after many insertions plus deletion 
 updates, my queries( column slice query ) are taking more time to be 
 performed. Even if there are only few columns, lower than 100.

 So it looks like that the longer is the number of columns being deleted, the 
 longer is the time spent for a query.
 - Internally at C*, does column slice query ranges among deleted columns?
 If so, how can I mitigate the impact in my queries? Or, how can I avoid those 
 deleted columns?

Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.comhttp://www.copy.com.


Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.


Re: Column Slice Query performance after deletions

2013-03-02 Thread Edward Capriolo
Casandra's data files are write once. Deletes are another write. Until
compaction they all live on disk.Making really big rows has these problem.

On Sat, Mar 2, 2013 at 1:42 PM, Michael Kjellman mkjell...@barracuda.comwrote:

 What is your gc_grace set to? Sounds like as the number of tombstones
 records increase your performance decreases. (Which I would expect)

 On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

 I have a daily maintenance of my cluster where I truncate this column
 family. Because its data doesnt need to be kept more than a day.
 Since all the regular operations on it finishes around 4 hours before
 finishing the day. I regurlarly run a truncate on it followed by a repair
 at the end of the day.

 And every day, when the operations are started(when are only few deleted
 columns), the performance looks pretty well.
 Unfortunately it is degraded along the day.


 On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman 
 mkjell...@barracuda.comwrote:

 When is the last time you did a cleanup on the cf?

 On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

  Hello guys.
  I'm investigating the reasons of performance degradation for my case
 scenario which follows:
 
  - I do have a column family which is filled of thousands of columns
 inside a unique row(varies between 10k ~ 200k). And I do have also
 thousands of rows, not much more than 15k.
  - This rows are constantly updated. But the write-load is not that
 intensive. I estimate it as 100w/sec in the column family.
  - Each column represents a message which is read and processed by
 another process. After reading it, the column is marked for deletion in
 order to keep it out from the next query on this row.
 
  Ok, so, I've been figured out that after many insertions plus deletion
 updates, my queries( column slice query ) are taking more time to be
 performed. Even if there are only few columns, lower than 100.
 
  So it looks like that the longer is the number of columns being
 deleted, the longer is the time spent for a query.
  - Internally at C*, does column slice query ranges among deleted
 columns?
  If so, how can I mitigate the impact in my queries? Or, how can I avoid
 those deleted columns?

 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com.



 --
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com http://www.copy.com?a=em_footer.
   ­­



Re: Column Slice Query performance after deletions

2013-03-02 Thread Víctor Hugo Oliveira Molinar
What is your gc_grace set to? Sounds like as the number of tombstones
records increase your performance decreases. (Which I would expect)


gr_grace is default.


Casandra's data files are write once. Deletes are another write. Until
compaction they all live on disk.Making really big rows has these problem.

Oh, so it looks like I should lower the min_compaction_threshold for this
column family. Right?
What does realy mean this threeshold value?


Guys, thanks for the help so far.

On Sat, Mar 2, 2013 at 3:42 PM, Michael Kjellman mkjell...@barracuda.comwrote:

 What is your gc_grace set to? Sounds like as the number of tombstones
 records increase your performance decreases. (Which I would expect)

 On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

 I have a daily maintenance of my cluster where I truncate this column
 family. Because its data doesnt need to be kept more than a day.
 Since all the regular operations on it finishes around 4 hours before
 finishing the day. I regurlarly run a truncate on it followed by a repair
 at the end of the day.

 And every day, when the operations are started(when are only few deleted
 columns), the performance looks pretty well.
 Unfortunately it is degraded along the day.


 On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman 
 mkjell...@barracuda.comwrote:

 When is the last time you did a cleanup on the cf?

 On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

  Hello guys.
  I'm investigating the reasons of performance degradation for my case
 scenario which follows:
 
  - I do have a column family which is filled of thousands of columns
 inside a unique row(varies between 10k ~ 200k). And I do have also
 thousands of rows, not much more than 15k.
  - This rows are constantly updated. But the write-load is not that
 intensive. I estimate it as 100w/sec in the column family.
  - Each column represents a message which is read and processed by
 another process. After reading it, the column is marked for deletion in
 order to keep it out from the next query on this row.
 
  Ok, so, I've been figured out that after many insertions plus deletion
 updates, my queries( column slice query ) are taking more time to be
 performed. Even if there are only few columns, lower than 100.
 
  So it looks like that the longer is the number of columns being
 deleted, the longer is the time spent for a query.
  - Internally at C*, does column slice query ranges among deleted
 columns?
  If so, how can I mitigate the impact in my queries? Or, how can I avoid
 those deleted columns?

 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com.



 --
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com http://www.copy.com?a=em_footer.
   ­­



Re: Column Slice Query performance after deletions

2013-03-02 Thread Michael Kjellman
Tombstones stay around until gc grace so you could lower that to see of that 
fixes the performance issues.

Size tiered or leveled comparison?

On Mar 2, 2013, at 11:15 AM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.commailto:vhmoli...@gmail.com wrote:

What is your gc_grace set to? Sounds like as the number of tombstones records 
increase your performance decreases. (Which I would expect)

gr_grace is default.


Casandra's data files are write once. Deletes are another write. Until 
compaction they all live on disk.Making really big rows has these problem.
Oh, so it looks like I should lower the min_compaction_threshold for this 
column family. Right?
What does realy mean this threeshold value?


Guys, thanks for the help so far.

On Sat, Mar 2, 2013 at 3:42 PM, Michael Kjellman 
mkjell...@barracuda.commailto:mkjell...@barracuda.com wrote:
What is your gc_grace set to? Sounds like as the number of tombstones records 
increase your performance decreases. (Which I would expect)

On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.commailto:vhmoli...@gmail.com wrote:

I have a daily maintenance of my cluster where I truncate this column family. 
Because its data doesnt need to be kept more than a day.
Since all the regular operations on it finishes around 4 hours before finishing 
the day. I regurlarly run a truncate on it followed by a repair at the end of 
the day.

And every day, when the operations are started(when are only few deleted 
columns), the performance looks pretty well.
Unfortunately it is degraded along the day.


On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman 
mkjell...@barracuda.commailto:mkjell...@barracuda.com wrote:
When is the last time you did a cleanup on the cf?

On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.commailto:vhmoli...@gmail.com wrote:

 Hello guys.
 I'm investigating the reasons of performance degradation for my case scenario 
 which follows:

 - I do have a column family which is filled of thousands of columns inside a 
 unique row(varies between 10k ~ 200k). And I do have also thousands of rows, 
 not much more than 15k.
 - This rows are constantly updated. But the write-load is not that intensive. 
 I estimate it as 100w/sec in the column family.
 - Each column represents a message which is read and processed by another 
 process. After reading it, the column is marked for deletion in order to keep 
 it out from the next query on this row.

 Ok, so, I've been figured out that after many insertions plus deletion 
 updates, my queries( column slice query ) are taking more time to be 
 performed. Even if there are only few columns, lower than 100.

 So it looks like that the longer is the number of columns being deleted, the 
 longer is the time spent for a query.
 - Internally at C*, does column slice query ranges among deleted columns?
 If so, how can I mitigate the impact in my queries? Or, how can I avoid those 
 deleted columns?

Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.comhttp://www.copy.com.


--
Copy, by Barracuda, helps you store, protect, and share all your amazing 
things. Start today: www.copy.comhttp://www.copy.com?a=em_footer.
  ­­


Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.


Re: Column Slice Query performance after deletions

2013-03-02 Thread Víctor Hugo Oliveira Molinar
Tombstones stay around until gc grace so you could lower that to see of
that fixes the performance issues.

If the tombstones get collected,the column will live again, causing data
inconsistency since I cant run a repair during the regular operations. Not
sure if I got your thoughts on this.


Size tiered or leveled comparison?


I'm actuallly running on Size Tiered Compaction, but I've been looking into
changing it for Leveled. It seems to be the case.  Although even if I
achieve some performance, I would still have the same problem with the
deleted columns.


I need something to keep the deleted columns away from my query fetch. Not
only the tombstones.
It looks like the min compaction might help on this. But I'm not sure yet
on what would be a reasonable value for its threeshold.


On Sat, Mar 2, 2013 at 4:22 PM, Michael Kjellman mkjell...@barracuda.comwrote:

 Tombstones stay around until gc grace so you could lower that to see of
 that fixes the performance issues.

 Size tiered or leveled comparison?

 On Mar 2, 2013, at 11:15 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

 What is your gc_grace set to? Sounds like as the number of tombstones
 records increase your performance decreases. (Which I would expect)


 gr_grace is default.


 Casandra's data files are write once. Deletes are another write. Until
 compaction they all live on disk.Making really big rows has these problem.

 Oh, so it looks like I should lower the min_compaction_threshold for this
 column family. Right?
 What does realy mean this threeshold value?


 Guys, thanks for the help so far.

 On Sat, Mar 2, 2013 at 3:42 PM, Michael Kjellman 
 mkjell...@barracuda.comwrote:

 What is your gc_grace set to? Sounds like as the number of tombstones
 records increase your performance decreases. (Which I would expect)

 On Mar 2, 2013, at 10:28 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

 I have a daily maintenance of my cluster where I truncate this column
 family. Because its data doesnt need to be kept more than a day.
 Since all the regular operations on it finishes around 4 hours before
 finishing the day. I regurlarly run a truncate on it followed by a repair
 at the end of the day.

 And every day, when the operations are started(when are only few deleted
 columns), the performance looks pretty well.
 Unfortunately it is degraded along the day.


 On Sat, Mar 2, 2013 at 2:54 PM, Michael Kjellman mkjell...@barracuda.com
  wrote:

 When is the last time you did a cleanup on the cf?

 On Mar 2, 2013, at 9:48 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

  Hello guys.
  I'm investigating the reasons of performance degradation for my case
 scenario which follows:
 
  - I do have a column family which is filled of thousands of columns
 inside a unique row(varies between 10k ~ 200k). And I do have also
 thousands of rows, not much more than 15k.
  - This rows are constantly updated. But the write-load is not that
 intensive. I estimate it as 100w/sec in the column family.
  - Each column represents a message which is read and processed by
 another process. After reading it, the column is marked for deletion in
 order to keep it out from the next query on this row.
 
  Ok, so, I've been figured out that after many insertions plus deletion
 updates, my queries( column slice query ) are taking more time to be
 performed. Even if there are only few columns, lower than 100.
 
  So it looks like that the longer is the number of columns being
 deleted, the longer is the time spent for a query.
  - Internally at C*, does column slice query ranges among deleted
 columns?
  If so, how can I mitigate the impact in my queries? Or, how can I
 avoid those deleted columns?

 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com.



 --
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com http://www.copy.com?a=em_footer.
   ­­



 --
 Copy, by Barracuda, helps you store, protect, and share all your amazing
 things. Start today: www.copy.com http://www.copy.com?a=em_footer.
   ­­



Query data in a CF within a timestamp range

2013-02-28 Thread Kasun Weranga
Hi all,

I have a column family with some data + timestamp values and I want to
query the column family to fetch data within a timestamp range. AFAIK it is
not better to use secondary index for timestamp due to high cardinality.

Is there a way to achieve this functionality?

Thanks,
Kasun.


Re: Query data in a CF within a timestamp range

2013-02-28 Thread Edward Capriolo
Pseudo code :

GregorianCalendar gc = new GregorianCalendar();
DateFormat df = new SimpleDateFormat( MMddhhmm');
String reversekey = df.format(gc);

set mycolumnfamily['myrow']['mycolumn'] = 'myvalue';
set myreverseindex['$reversekey]['myrow'] = '';

Under rapid insertion this makes hot-spots. Not an easy way around
that other then sharding the reverse index.


On Thu, Feb 28, 2013 at 5:49 PM, Kasun Weranga kas...@wso2.com wrote:
 Hi all,

 I have a column family with some data + timestamp values and I want to query
 the column family to fetch data within a timestamp range. AFAIK it is not
 better to use secondary index for timestamp due to high cardinality.

 Is there a way to achieve this functionality?

 Thanks,
 Kasun.


Re: How to limit query results like from row 50 to 100

2013-02-21 Thread aaron morton
CQL does not support offset but does have limit. 

See 
http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT#specifying-rows-returned-using-limit

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/02/2013, at 1:47 PM, Mateus Ferreira e Freitas 
mateus.ffrei...@hotmail.com wrote:

 With CQL or an API.



How to limit query results like from row 50 to 100

2013-02-19 Thread Mateus Ferreira e Freitas




With CQL or an API.   

Re: Secondary index query + 2 Datacenters + Row Cache + Restart = 0 rows

2013-02-05 Thread Alexei Bakanov
I tried to run with tracing, but it says 'Scanned 0 rows and matched 0'.
I found existing issue on this bug
https://issues.apache.org/jira/browse/CASSANDRA-4973
I made a d-test for reproducing it and attached to the ticket.

Alexei

On 2 February 2013 23:00, aaron morton aa...@thelastpickle.com wrote:
 Can you run the select in cqlsh and enabling tracing (see the cqlsh online
 help).

 If you can replicate it then place raise a ticket on
 https://issues.apache.org/jira/browse/CASSANDRA and update email thread.

 Thanks

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 1/02/2013, at 9:03 PM, Alexei Bakanov russ...@gmail.com wrote:

 Hello,

 I've found a combination that doesn't work:
 A column family that have a secondary index and caching='ALL' with
 data in two datacenters and I do a restart of the nodes, then my
 secondary index queries start returning 0 rows.
 It happens when amount of data goes over a certain threshold, so I
 suspect that compactions are involved in this as well.
 Taking out one of the ingredients fixes the problem and my queries
 return rows from secondary index.
 I suspect that this guy is struggling with the same thing
 https://issues.apache.org/jira/browse/CASSANDRA-4785

 Here is a sequence of actions that reproduces it with help of CCM:

 $ ccm create --cassandra-version 1.2.1 --nodes 2 -p RandomPartitioner
 testRowCacheDC
 $ ccm updateconf 'endpoint_snitch: PropertyFileSnitch'
 $ ccm updateconf 'row_cache_size_in_mb: 200'
 $ cp ~/Downloads/cassandra-topology.properties
 ~/.ccm/testRowCacheDC/node1/conf/  (please find .properties file
 below)
 $ cp ~/Downloads/cassandra-topology.properties
 ~/.ccm/testRowCacheDC/node2/conf/
 $ ccm start
 $ ccm cli
 -create keyspace and column family(please find schema below)
 $ python populate_rowcache.py
 $ ccm stop  (I tried flush first, doesn't help)
 $ ccm start
 $ ccm cli
 Connected to: testRowCacheDC on 127.0.0.1/9160
 Welcome to Cassandra CLI version 1.2.1-SNAPSHOT

 Type 'help;' or '?' for help.
 Type 'quit;' or 'exit;' to quit.

 [default@unknown] use testks;
 Authenticated to keyspace: testks
 [default@testks] get cf1 where 'indexedColumn'='userId_75';

 0 Row Returned.
 Elapsed time: 68 msec(s).

 My cassandra instances run with -Xms1927M -Xmx1927M -Xmn400M
 Thanks for help.

 Best regards,
 Alexei


 -- START cassandra-topology.properties --
 127.0.0.1=DC1:RAC1
 127.0.0.2=DC2:RAC1
 default=DC1:r1
 -- FINISH cassandra-topology.properties --

 -- START cassandra-cli schema ---
 create keyspace testks
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {DC2 : 1, DC1 : 1}
  and durable_writes = true;

 use testks;

 create column family cf1
  with column_type = 'Standard'
  and comparator = 'org.apache.cassandra.db.marshal.AsciiType'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UTF8Type'
  and read_repair_chance = 1.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'ALL'
  and column_metadata = [
{column_name : 'indexedColumn',
validation_class : UTF8Type,
index_name : 'INDEX1',
index_type : 0}]
  and compression_options = {'sstable_compression' :
 'org.apache.cassandra.io.compress.SnappyCompressor'};
 ---FINISH cassandra-cli schema ---

 -- START populate_rowcache.py ---
 from pycassa.batch import Mutator

 import pycassa

 pool = pycassa.ConnectionPool('testks', timeout=5)
 cf = pycassa.ColumnFamily(pool, 'cf1')

 for userId in xrange(0, 1000):
print userId
b = Mutator(pool, queue_size=200)
for itemId in xrange(20):
rowKey = 'userId_%s:itemId_%s'%(userId, itemId)
for message_number in xrange(10):
b.insert(cf, rowKey, {'indexedColumn': 'userId_%s'%userId,
 str(message_number): str(message_number)})
b.send()

 pool.dispose()
 -- FINISH populate_rowcache.py ---




Re: Secondary index query + 2 Datacenters + Row Cache + Restart = 0 rows

2013-02-02 Thread aaron morton
Can you run the select in cqlsh and enabling tracing (see the cqlsh online 
help). 

If you can replicate it then place raise a ticket on 
https://issues.apache.org/jira/browse/CASSANDRA and update email thread. 

Thanks

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 1/02/2013, at 9:03 PM, Alexei Bakanov russ...@gmail.com wrote:

 Hello,
 
 I've found a combination that doesn't work:
 A column family that have a secondary index and caching='ALL' with
 data in two datacenters and I do a restart of the nodes, then my
 secondary index queries start returning 0 rows.
 It happens when amount of data goes over a certain threshold, so I
 suspect that compactions are involved in this as well.
 Taking out one of the ingredients fixes the problem and my queries
 return rows from secondary index.
 I suspect that this guy is struggling with the same thing
 https://issues.apache.org/jira/browse/CASSANDRA-4785
 
 Here is a sequence of actions that reproduces it with help of CCM:
 
 $ ccm create --cassandra-version 1.2.1 --nodes 2 -p RandomPartitioner
 testRowCacheDC
 $ ccm updateconf 'endpoint_snitch: PropertyFileSnitch'
 $ ccm updateconf 'row_cache_size_in_mb: 200'
 $ cp ~/Downloads/cassandra-topology.properties
 ~/.ccm/testRowCacheDC/node1/conf/  (please find .properties file
 below)
 $ cp ~/Downloads/cassandra-topology.properties 
 ~/.ccm/testRowCacheDC/node2/conf/
 $ ccm start
 $ ccm cli
 -create keyspace and column family(please find schema below)
 $ python populate_rowcache.py
 $ ccm stop  (I tried flush first, doesn't help)
 $ ccm start
 $ ccm cli
 Connected to: testRowCacheDC on 127.0.0.1/9160
 Welcome to Cassandra CLI version 1.2.1-SNAPSHOT
 
 Type 'help;' or '?' for help.
 Type 'quit;' or 'exit;' to quit.
 
 [default@unknown] use testks;
 Authenticated to keyspace: testks
 [default@testks] get cf1 where 'indexedColumn'='userId_75';
 
 0 Row Returned.
 Elapsed time: 68 msec(s).
 
 My cassandra instances run with -Xms1927M -Xmx1927M -Xmn400M
 Thanks for help.
 
 Best regards,
 Alexei
 
 
 -- START cassandra-topology.properties --
 127.0.0.1=DC1:RAC1
 127.0.0.2=DC2:RAC1
 default=DC1:r1
 -- FINISH cassandra-topology.properties --
 
 -- START cassandra-cli schema ---
 create keyspace testks
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {DC2 : 1, DC1 : 1}
  and durable_writes = true;
 
 use testks;
 
 create column family cf1
  with column_type = 'Standard'
  and comparator = 'org.apache.cassandra.db.marshal.AsciiType'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UTF8Type'
  and read_repair_chance = 1.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'ALL'
  and column_metadata = [
{column_name : 'indexedColumn',
validation_class : UTF8Type,
index_name : 'INDEX1',
index_type : 0}]
  and compression_options = {'sstable_compression' :
 'org.apache.cassandra.io.compress.SnappyCompressor'};
 ---FINISH cassandra-cli schema ---
 
 -- START populate_rowcache.py ---
 from pycassa.batch import Mutator
 
 import pycassa
 
 pool = pycassa.ConnectionPool('testks', timeout=5)
 cf = pycassa.ColumnFamily(pool, 'cf1')
 
 for userId in xrange(0, 1000):
print userId
b = Mutator(pool, queue_size=200)
for itemId in xrange(20):
rowKey = 'userId_%s:itemId_%s'%(userId, itemId)
for message_number in xrange(10):
b.insert(cf, rowKey, {'indexedColumn': 'userId_%s'%userId,
 str(message_number): str(message_number)})
b.send()
 
 pool.dispose()
 -- FINISH populate_rowcache.py ---



Secondary index query + 2 Datacenters + Row Cache + Restart = 0 rows

2013-02-01 Thread Alexei Bakanov
Hello,

I've found a combination that doesn't work:
A column family that have a secondary index and caching='ALL' with
data in two datacenters and I do a restart of the nodes, then my
secondary index queries start returning 0 rows.
It happens when amount of data goes over a certain threshold, so I
suspect that compactions are involved in this as well.
Taking out one of the ingredients fixes the problem and my queries
return rows from secondary index.
I suspect that this guy is struggling with the same thing
https://issues.apache.org/jira/browse/CASSANDRA-4785

Here is a sequence of actions that reproduces it with help of CCM:

$ ccm create --cassandra-version 1.2.1 --nodes 2 -p RandomPartitioner
testRowCacheDC
$ ccm updateconf 'endpoint_snitch: PropertyFileSnitch'
$ ccm updateconf 'row_cache_size_in_mb: 200'
$ cp ~/Downloads/cassandra-topology.properties
~/.ccm/testRowCacheDC/node1/conf/  (please find .properties file
below)
$ cp ~/Downloads/cassandra-topology.properties ~/.ccm/testRowCacheDC/node2/conf/
$ ccm start
$ ccm cli
 -create keyspace and column family(please find schema below)
$ python populate_rowcache.py
$ ccm stop  (I tried flush first, doesn't help)
$ ccm start
$ ccm cli
Connected to: testRowCacheDC on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.1-SNAPSHOT

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use testks;
Authenticated to keyspace: testks
[default@testks] get cf1 where 'indexedColumn'='userId_75';

0 Row Returned.
Elapsed time: 68 msec(s).

My cassandra instances run with -Xms1927M -Xmx1927M -Xmn400M
Thanks for help.

Best regards,
Alexei


-- START cassandra-topology.properties --
127.0.0.1=DC1:RAC1
127.0.0.2=DC2:RAC1
default=DC1:r1
-- FINISH cassandra-topology.properties --

-- START cassandra-cli schema ---
create keyspace testks
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {DC2 : 1, DC1 : 1}
  and durable_writes = true;

use testks;

create column family cf1
  with column_type = 'Standard'
  and comparator = 'org.apache.cassandra.db.marshal.AsciiType'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UTF8Type'
  and read_repair_chance = 1.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'ALL'
  and column_metadata = [
{column_name : 'indexedColumn',
validation_class : UTF8Type,
index_name : 'INDEX1',
index_type : 0}]
  and compression_options = {'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};
---FINISH cassandra-cli schema ---

-- START populate_rowcache.py ---
from pycassa.batch import Mutator

import pycassa

pool = pycassa.ConnectionPool('testks', timeout=5)
cf = pycassa.ColumnFamily(pool, 'cf1')

for userId in xrange(0, 1000):
print userId
b = Mutator(pool, queue_size=200)
for itemId in xrange(20):
rowKey = 'userId_%s:itemId_%s'%(userId, itemId)
for message_number in xrange(10):
b.insert(cf, rowKey, {'indexedColumn': 'userId_%s'%userId,
str(message_number): str(message_number)})
b.send()

pool.dispose()
-- FINISH populate_rowcache.py ---


Re: Perfroming simple CQL Query using pyhton db-api 2.0 fails

2013-01-24 Thread aaron morton
How did you create the table? 

Anyways that looks like a bug, I *think* they should go here 
http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/issues/list

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 24/01/2013, at 7:14 AM, Paul van Hoven paul.van.ho...@googlemail.com wrote:

 I try to access my local cassandra database via python. Therefore I
 installed db-api 2.0 and thrift for accessing the database. Opening
 and closing a connection works fine. But a simply query is not
 working:
 
 The script looks like this:
 
c = conn.cursor()
c.execute(select * from users;)
data = c.fetchall()
print Query: select * from users; returned the following result:
print str(data)
 
 
 The table users looks like this:
 qlsh:demodb select * from users;
 
 user_name | birth_year | gender | password | session_token | state
 ---+++--+---+---
jsmith |   null |   null |   secret |  null |  null
 
 
 
 But when I try to execute it I get the following error:
 Open connection to localhost:9160 on keyspace demodb
 Traceback (most recent call last):
  File 
 /Users/Tom/Freelancing/Company/Python/ApacheCassandra/src/CassandraDemo.py,
 line 56, in module
perfromSimpleCQLQuery()
  File 
 /Users/Tom/Freelancing/Company/Python/ApacheCassandra/src/CassandraDemo.py,
 line 46, in perfromSimpleCQLQuery
c.execute(select * from users;)
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 81, in execute
return self.process_execution_results(response, decoder=decoder)
  File /Library/Python/2.7/site-packages/cql/thrifteries.py, line
 116, in process_execution_results
self.get_metadata_info(self.result[0])
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 97, in
 get_metadata_info
name, nbytes, vtype, ctype = self.get_column_metadata(colid)
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 104, in
 get_column_metadata
return self.decoder.decode_metadata_and_type(column_id)
  File /Library/Python/2.7/site-packages/cql/decoders.py, line 45,
 in decode_metadata_and_type
name = self.name_decode_error(e, namebytes,
 comptype.cql_parameterized_type())
  File /Library/Python/2.7/site-packages/cql/decoders.py, line 29,
 in name_decode_error
% (namebytes, expectedtype, err))
 cql.apivalues.ProgrammingError: column name '\x00\x00\x00' can't be
 deserialized as 'org.apache.cassandra.db.marshal.CompositeType':
 global name 'self' is not defined
 
 I'm not shure if this is the right place to ask for: But am I doing
 here something wrong?



Re: Perfroming simple CQL Query using pyhton db-api 2.0 fails

2013-01-24 Thread Paul van Hoven
The reason for the error was that I opened the connection to the database wrong.

I did:
con = cql.connect(host, port, keyspace)

but correct is:
con = cql.connect(host, port, keyspace, cql_version='3.0.0')

Now it works fine. Thanks for reading.

2013/1/24 aaron morton aa...@thelastpickle.com:
 How did you create the table?

 Anyways that looks like a bug, I *think* they should go here
 http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/issues/list

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 24/01/2013, at 7:14 AM, Paul van Hoven paul.van.ho...@googlemail.com
 wrote:

 I try to access my local cassandra database via python. Therefore I
 installed db-api 2.0 and thrift for accessing the database. Opening
 and closing a connection works fine. But a simply query is not
 working:

 The script looks like this:

c = conn.cursor()
c.execute(select * from users;)
data = c.fetchall()
print Query: select * from users; returned the following result:
print str(data)


 The table users looks like this:
 qlsh:demodb select * from users;

 user_name | birth_year | gender | password | session_token | state
 ---+++--+---+---
jsmith |   null |   null |   secret |  null |  null



 But when I try to execute it I get the following error:
 Open connection to localhost:9160 on keyspace demodb
 Traceback (most recent call last):
  File
 /Users/Tom/Freelancing/Company/Python/ApacheCassandra/src/CassandraDemo.py,
 line 56, in module
perfromSimpleCQLQuery()
  File
 /Users/Tom/Freelancing/Company/Python/ApacheCassandra/src/CassandraDemo.py,
 line 46, in perfromSimpleCQLQuery
c.execute(select * from users;)
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 81, in execute
return self.process_execution_results(response, decoder=decoder)
  File /Library/Python/2.7/site-packages/cql/thrifteries.py, line
 116, in process_execution_results
self.get_metadata_info(self.result[0])
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 97, in
 get_metadata_info
name, nbytes, vtype, ctype = self.get_column_metadata(colid)
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 104, in
 get_column_metadata
return self.decoder.decode_metadata_and_type(column_id)
  File /Library/Python/2.7/site-packages/cql/decoders.py, line 45,
 in decode_metadata_and_type
name = self.name_decode_error(e, namebytes,
 comptype.cql_parameterized_type())
  File /Library/Python/2.7/site-packages/cql/decoders.py, line 29,
 in name_decode_error
% (namebytes, expectedtype, err))
 cql.apivalues.ProgrammingError: column name '\x00\x00\x00' can't be
 deserialized as 'org.apache.cassandra.db.marshal.CompositeType':
 global name 'self' is not defined

 I'm not shure if this is the right place to ask for: But am I doing
 here something wrong?




Perfroming simple CQL Query using pyhton db-api 2.0 fails

2013-01-23 Thread Paul van Hoven
I try to access my local cassandra database via python. Therefore I
installed db-api 2.0 and thrift for accessing the database. Opening
and closing a connection works fine. But a simply query is not
working:

The script looks like this:

c = conn.cursor()
c.execute(select * from users;)
data = c.fetchall()
print Query: select * from users; returned the following result:
print str(data)


The table users looks like this:
qlsh:demodb select * from users;

 user_name | birth_year | gender | password | session_token | state
---+++--+---+---
jsmith |   null |   null |   secret |  null |  null



But when I try to execute it I get the following error:
Open connection to localhost:9160 on keyspace demodb
Traceback (most recent call last):
  File 
/Users/Tom/Freelancing/Company/Python/ApacheCassandra/src/CassandraDemo.py,
line 56, in module
perfromSimpleCQLQuery()
  File 
/Users/Tom/Freelancing/Company/Python/ApacheCassandra/src/CassandraDemo.py,
line 46, in perfromSimpleCQLQuery
c.execute(select * from users;)
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 81, in execute
return self.process_execution_results(response, decoder=decoder)
  File /Library/Python/2.7/site-packages/cql/thrifteries.py, line
116, in process_execution_results
self.get_metadata_info(self.result[0])
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 97, in
get_metadata_info
name, nbytes, vtype, ctype = self.get_column_metadata(colid)
  File /Library/Python/2.7/site-packages/cql/cursor.py, line 104, in
get_column_metadata
return self.decoder.decode_metadata_and_type(column_id)
  File /Library/Python/2.7/site-packages/cql/decoders.py, line 45,
in decode_metadata_and_type
name = self.name_decode_error(e, namebytes,
comptype.cql_parameterized_type())
  File /Library/Python/2.7/site-packages/cql/decoders.py, line 29,
in name_decode_error
% (namebytes, expectedtype, err))
cql.apivalues.ProgrammingError: column name '\x00\x00\x00' can't be
deserialized as 'org.apache.cassandra.db.marshal.CompositeType':
global name 'self' is not defined

I'm not shure if this is the right place to ask for: But am I doing
here something wrong?


Composite Keys Query

2013-01-17 Thread Renato Marroquín Mogrovejo
Hi all,

I am using some composite keys to get just some specific composite
columns names which I am using as follows:

create column family video_event
  with comparator = 'CompositeType(UTF8Type,UTF8Type)'
  and key_validation_class = 'UTF8Type'
  and default_validation_class = 'UTF8Type';
  column_metadata =
  [
{column_name: event, validation_class: UTF8Type}
  ];

The data it contains is as follows:

RowKey: otaner9902:94dd885a-655f-4d5c-adaf-db6a6c51d4ac
= (column=start:2013-01-17 13:31:05.072, value=, timestamp=1358447465294000)
= (column=stop:2013-01-17 13:31:05.402, value=2013-01-17
13:31:05.402, timestamp=1358447465402000)

And I am using the following code to retrieve the data:

Composite start = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);
Composite end = compositeFrom(startArg,
Composite.ComponentEquality.GREATER_THAN_EQUAL);

 VideoEventCompositeQueryIterator iter =
new VideoEventCompositeQueryIterator(ALL, start,
end, keyspace);

The thing is that I keep on getting zero columns back and I am really
getting to a point that its driving crazy. I have uploaded all my code
to my github account[1] and this specific class is in[2].
Any pointers are more than welcome! Thanks in advance!


Renato M.

[1] https://github.com/renato2099/cassandra12-video-app-hector
[2]https://github.com/renato2099/cassandra12-video-app-hector/blob/master/src/com/killrvideo/BusinessLogic.java


Re: Query column names

2013-01-16 Thread Renato Marroquín Mogrovejo
What I mean is that if there is a way of doing this but using Hector:


-
public static void main(String[] args) throws Exception {
Connector conn = new Connector();
Cassandra.Client client = conn.connect();

SlicePredicate predicate = new SlicePredicate();
Listbyte[] colNames = new ArrayListbyte[]();
colNames.add(a.getBytes());
colNames.add(b.getBytes());
predicate.column_names = colNames;

ColumnParent parent = new ColumnParent(Standard1);

byte[] key = k1.getBytes();
ListColumnOrSuperColumn results =
client.get_slice(key, parent, predicate, ConsistencyLevel.ONE);

for (ColumnOrSuperColumn cosc : results) {
Column c = cosc.column;
System.out.println(new String(c.name, UTF-8) +  : 
+ new String(c.value, UTF-8));
}

conn.close();

System.out.println(All done.);
}
-


Thanks!

2013/1/16 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com:
 Hi,

 I am facing some problems while retrieving a some events from a column
 family. I am using as column family name the event name plus the
 timestamp of when it occurred.
 The thing is that now I want to find out the latest event and I don't
 how to query asking for the last event without a RangeSlicesQuery,
 getting all rows, and columns, and asking one by one.
 Is there any other better way of doing this using Hector client?

 [default@clickstream] list click_event;
 ---
 RowKey: 
 706d63666164696e3a31396132613664322d633730642d343139362d623638642d396663663638343766333563
 = (column=start:2013-01-13 18:14:59.244, value=, timestamp=1358118943979000)
 = (column=stop:2013-01-13 18:15:56.793,
 value=323031332d30312d31332031383a31353a35382e333437,
 timestamp=1358118960946000)

 Thanks in advance!


 Renato M.


Re: Query column names

2013-01-16 Thread Renato Marroquín Mogrovejo
After searching for a while I found what I was looking for [1]
Hope it helps to someone else (:


Renato M.

[1] http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1

2013/1/16 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com:
 What I mean is that if there is a way of doing this but using Hector:


 -
 public static void main(String[] args) throws Exception {
 Connector conn = new Connector();
 Cassandra.Client client = conn.connect();

 SlicePredicate predicate = new SlicePredicate();
 Listbyte[] colNames = new ArrayListbyte[]();
 colNames.add(a.getBytes());
 colNames.add(b.getBytes());
 predicate.column_names = colNames;

 ColumnParent parent = new ColumnParent(Standard1);

 byte[] key = k1.getBytes();
 ListColumnOrSuperColumn results =
 client.get_slice(key, parent, predicate, ConsistencyLevel.ONE);

 for (ColumnOrSuperColumn cosc : results) {
 Column c = cosc.column;
 System.out.println(new String(c.name, UTF-8) +  : 
 + new String(c.value, UTF-8));
 }

 conn.close();

 System.out.println(All done.);
 }
 -


 Thanks!

 2013/1/16 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com:
 Hi,

 I am facing some problems while retrieving a some events from a column
 family. I am using as column family name the event name plus the
 timestamp of when it occurred.
 The thing is that now I want to find out the latest event and I don't
 how to query asking for the last event without a RangeSlicesQuery,
 getting all rows, and columns, and asking one by one.
 Is there any other better way of doing this using Hector client?

 [default@clickstream] list click_event;
 ---
 RowKey: 
 706d63666164696e3a31396132613664322d633730642d343139362d623638642d396663663638343766333563
 = (column=start:2013-01-13 18:14:59.244, value=, timestamp=1358118943979000)
 = (column=stop:2013-01-13 18:15:56.793,
 value=323031332d30312d31332031383a31353a35382e333437,
 timestamp=1358118960946000)

 Thanks in advance!


 Renato M.


Re: Collecting of tombstones columns during read query fills up heap

2013-01-14 Thread aaron morton
 Just so I understand, the file contents are *not* stored in the column value 
 ?
 
 No, on that particular CF the columns are SuperColumns with 5 sub columns 
 (size, is_dir, hash, name, revision). Each super column is small, I didn't 
 mention super columns before because they don't seem to be related to the 
 problem at hand.
I strongly recommend you not use super columns. While they are still supported 
they are not accessible via CQL and perform poorly compared to standard CF's. 

Still confused, are the contents of a file stored in cassandra ? If you are 
storing the contents in separating them from the meta data will make the 
operations on the meta data much faster. 

 Millions. I have dumped the SSTables to JSON, but have yet to figure out a 
 way to parse and obtain more information like an exact number since the files 
 are so big.
That's too many. 
You'll need to look at some of the previous suggestion for schema or compaction 
strategy changes. 

 For example, using your suggestion, it would imply some sort of 
 synchronisation whenever an operation decided that a new row should be 
 written otherwise I could lose some updates. 
I think you can do it without synchronisation or ZK. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/01/2013, at 12:10 PM, André Cruz andre.c...@co.sapo.pt wrote:

 On Jan 10, 2013, at 8:01 PM, aaron morton aa...@thelastpickle.com wrote:
 
  So, one column represents a file in that directory and it has no value.
 Just so I understand, the file contents are *not* stored in the column value 
 ?
 
 No, on that particular CF the columns are SuperColumns with 5 sub columns 
 (size, is_dir, hash, name, revision). Each super column is small, I didn't 
 mention super columns before because they don't seem to be related to the 
 problem at hand.
 
 Basically the heap fills up and if several queries happens simultaneously, 
 the heap is exhausted and the node stops.
 Are you seeing the GCInspector log messages ? Are they ParNew or CMS 
 compactions?
 If you want to get more insight into what the JVM is doing enable the GC 
 logging options in cassandra-env.sh. 
 
 I see a lot of messages regarding SliceQueryFilter:
 
 DEBUG [ReadStage:53] 2013-01-08 18:08:36,451 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(edbc633e-3f09-11e2-8f7d-e0db55018fa4 -delete 
 at 1357508861622915- 
 [hash:false:36@1354732265022159,is_dir:false:1@1354732265022159,mtime:false:4@1354732265022159,name:false:57@1354732265022159,revision:false:16@1354732265022159,])
 DEBUG [ReadStage:62] 2013-01-08 18:08:36,467 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(75869f16-3f0d-11e2-a935-e0db550199f4 -delete 
 at 1357543298946499- 
 [hash:false:36@1354733781339045,is_dir:false:1@1354733781339045,mtime:false:4@1354733781339045,name:false:56@1354733781339045,revision:false:16@1354733781339045,])
 DEBUG [ReadStage:64] 2013-01-08 18:08:36,449 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(6b3323de-3f0a-11e2-93b7-e0db55018fa4 -delete 
 at 1357543981711099- 
 [hash:false:36@1354732475524213,is_dir:false:1@1354732475524213,mtime:false:4@1354732475524213,name:false:56@1354732475524213,revision:false:16@1354732475524213,])
 DEBUG [ReadStage:51] 2013-01-08 18:08:36,448 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(2e2ccb66-3f0f-11e2-9f34-e0db5501ca40 -delete 
 at 1357548656930340- 
 [hash:false:36@1354734520625161,is_dir:false:1@1354734520625161,mtime:false:4@1354734520625161,name:false:54@1354734520625161,revision:false:16@1354734520625161,])
 DEBUG [ReadStage:62] 2013-01-08 18:08:36,468 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(758c5f3c-3f0d-11e2-a935-e0db550199f4 -delete 
 at 1357543303722497- 
 [hash:false:36@1354733781376479,is_dir:false:1@1354733781376479,mtime:false:4@1354733781376479,name:false:56@1354733781376479,revision:false:16@1354733781376479,])
 DEBUG [ReadStage:61] 2013-01-08 18:08:36,447 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(be15520e-3f08-11e2-843b-e0db550199f4 -delete 
 at 1357508704355097- 
 [hash:false:36@1354731755577230,is_dir:false:1@1354731755577230,mtime:false:4@1354731755577230,name:false:57@1354731755577230,revision:false:16@1354731755577230,])
 DEBUG [ReadStage:52] 2013-01-08 18:08:36,446 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(463b877e-3f0a-11e2-b990-e0db55018fa4 -delete 
 at 1357543038078223- 
 [hash:false:36@1354732413504338,is_dir:false:1@1354732413504338,mtime:false:4@1354732413504338,name:false:57@1354732413504338,revision:false:16@1354732413504338,])
 DEBUG [ReadStage:52] 2013-01-08 18:08:36,471 SliceQueryFilter.java (line 124) 
 collecting 1 of 102: SuperColumn(463ef5c6-3f0a-11e2-b990-e0db55018fa4 -delete 
 at 1357543038078223- 
 

Collecting of tombstones columns during read query fills up heap

2013-01-10 Thread André Cruz
Hello.

I have a schema to represent a filesystem for my users. In this schema one of 
the CF stores a directory listing this way:

CF DirList

   Dir1:
 File1:NOVAL File2:NOVAL ...

So, one column represents a file in that directory and it has no value. The 
file metadata is stored elsewhere. When listing the contents of a directory I 
fetch the row contents in batches (using pycassa's column_count and 
column_start) and always limit the number of columns that I want returned, so 
as not to occupy too much memory on the Cassandra server. However, if a certain 
user has deleted a lot of files in that dir and so has a lot of tombstones, 
even fetching with a column_count of 2 can pose problems to the Cassandra 
server. Basically the heap fills up and if several queries happens 
simultaneously, the heap is exhausted and the node stops. Dumping the SSTables 
shows that there were a lot of tombstones between those 2 columns.

Is there anything, other than schema changes or throttling on the application 
side, than I can do to prevent problems like these? Basically I would like 
Cassandra to stop a query if the resultset already has X items whether they are 
tombstones or not, and return an error. Or maybe it can stop if the resultset 
already occupies more then Y bytes or the heap is almost full. Some safety 
valve to prevent a DoS.

I should point out that I am using 1.1.5, but I have not seen anything in the 
changelog that may reference this issue or more recent releases. Normally I run 
with a 8GB heap and have no problems, but problematic queries can fill up the 
heap even if I bump it up to 24GB. The machines have 32GB.

Of course, the problem goes away after gc_grace_seconds pass and I run a manual 
compact on that CF, the tombstones are removed and queries to that row are 
efficient again.

Thanks,
André Cruz

Re: Collecting of tombstones columns during read query fills up heap

2013-01-10 Thread aaron morton
  So, one column represents a file in that directory and it has no value.
Just so I understand, the file contents are *not* stored in the column value ?

 Basically the heap fills up and if several queries happens simultaneously, 
 the heap is exhausted and the node stops.
Are you seeing the GCInspector log messages ? Are they ParNew or CMS 
compactions?
If you want to get more insight into what the JVM is doing enable the GC 
logging options in cassandra-env.sh. 

 Dumping the SSTables shows that there were a lot of tombstones between those 
 2 columns.
How many is a lot ?

  Normally I run with a 8GB heap and have no problems, but problematic queries 
 can fill up the heap even if I bump it up to 24GB. The machines have 32GB.
For queries like this it's (usually) not the overall size of the JVM heap, Xmx.
It's the size of the NEW_HEAP (in cassandra-env.sh) which sets Xmn. And the 
other new heap settings, SurvivorRatio and MaxTenuringThreshold. What settings 
do you have for those ?

 Of course, the problem goes away after gc_grace_seconds pass and I run a 
 manual compact on that CF, the tombstones are removed and queries to that row 
 are efficient again.
If you have a CF that has a high number of overwrites or deletions using 
Levelled Compaction can help. It does use up some more IO that sized tiered but 
it's designed for these sorts of situations. See 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra and 
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

Schema wise, you could try have multiple directory rows for each user. At 
certain times you can create a new row, which then receives all the writes. But 
you read (and delete if necessary) from all rows. Then migrate the data from 
the old rows to the new one and remove the old row.

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/01/2013, at 12:37 AM, André Cruz andre.c...@co.sapo.pt wrote:

 Hello.
 
 I have a schema to represent a filesystem for my users. In this schema one of 
 the CF stores a directory listing this way:
 
 CF DirList
 
   Dir1:
 File1:NOVAL File2:NOVAL ...
 
 So, one column represents a file in that directory and it has no value. The 
 file metadata is stored elsewhere. When listing the contents of a directory I 
 fetch the row contents in batches (using pycassa's column_count and 
 column_start) and always limit the number of columns that I want returned, so 
 as not to occupy too much memory on the Cassandra server. However, if a 
 certain user has deleted a lot of files in that dir and so has a lot of 
 tombstones, even fetching with a column_count of 2 can pose problems to the 
 Cassandra server. Basically the heap fills up and if several queries happens 
 simultaneously, the heap is exhausted and the node stops. Dumping the 
 SSTables shows that there were a lot of tombstones between those 2 columns.
 
 Is there anything, other than schema changes or throttling on the application 
 side, than I can do to prevent problems like these? Basically I would like 
 Cassandra to stop a query if the resultset already has X items whether they 
 are tombstones or not, and return an error. Or maybe it can stop if the 
 resultset already occupies more then Y bytes or the heap is almost full. Some 
 safety valve to prevent a DoS.
 
 I should point out that I am using 1.1.5, but I have not seen anything in the 
 changelog that may reference this issue or more recent releases. Normally I 
 run with a 8GB heap and have no problems, but problematic queries can fill up 
 the heap even if I bump it up to 24GB. The machines have 32GB.
 
 Of course, the problem goes away after gc_grace_seconds pass and I run a 
 manual compact on that CF, the tombstones are removed and queries to that row 
 are efficient again.
 
 Thanks,
 André Cruz



Re: Collecting of tombstones columns during read query fills up heap

2013-01-10 Thread André Cruz
On Jan 10, 2013, at 8:01 PM, aaron morton aa...@thelastpickle.com wrote:

  So, one column represents a file in that directory and it has no value.
 Just so I understand, the file contents are *not* stored in the column value ?

No, on that particular CF the columns are SuperColumns with 5 sub columns 
(size, is_dir, hash, name, revision). Each super column is small, I didn't 
mention super columns before because they don't seem to be related to the 
problem at hand.

 Basically the heap fills up and if several queries happens simultaneously, 
 the heap is exhausted and the node stops.
 Are you seeing the GCInspector log messages ? Are they ParNew or CMS 
 compactions?
 If you want to get more insight into what the JVM is doing enable the GC 
 logging options in cassandra-env.sh. 

I see a lot of messages regarding SliceQueryFilter:

DEBUG [ReadStage:53] 2013-01-08 18:08:36,451 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(edbc633e-3f09-11e2-8f7d-e0db55018fa4 -delete 
at 1357508861622915- 
[hash:false:36@1354732265022159,is_dir:false:1@1354732265022159,mtime:false:4@1354732265022159,name:false:57@1354732265022159,revision:false:16@1354732265022159,])
DEBUG [ReadStage:62] 2013-01-08 18:08:36,467 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(75869f16-3f0d-11e2-a935-e0db550199f4 -delete 
at 1357543298946499- 
[hash:false:36@1354733781339045,is_dir:false:1@1354733781339045,mtime:false:4@1354733781339045,name:false:56@1354733781339045,revision:false:16@1354733781339045,])
DEBUG [ReadStage:64] 2013-01-08 18:08:36,449 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(6b3323de-3f0a-11e2-93b7-e0db55018fa4 -delete 
at 1357543981711099- 
[hash:false:36@1354732475524213,is_dir:false:1@1354732475524213,mtime:false:4@1354732475524213,name:false:56@1354732475524213,revision:false:16@1354732475524213,])
DEBUG [ReadStage:51] 2013-01-08 18:08:36,448 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(2e2ccb66-3f0f-11e2-9f34-e0db5501ca40 -delete 
at 1357548656930340- 
[hash:false:36@1354734520625161,is_dir:false:1@1354734520625161,mtime:false:4@1354734520625161,name:false:54@1354734520625161,revision:false:16@1354734520625161,])
DEBUG [ReadStage:62] 2013-01-08 18:08:36,468 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(758c5f3c-3f0d-11e2-a935-e0db550199f4 -delete 
at 1357543303722497- 
[hash:false:36@1354733781376479,is_dir:false:1@1354733781376479,mtime:false:4@1354733781376479,name:false:56@1354733781376479,revision:false:16@1354733781376479,])
DEBUG [ReadStage:61] 2013-01-08 18:08:36,447 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(be15520e-3f08-11e2-843b-e0db550199f4 -delete 
at 1357508704355097- 
[hash:false:36@1354731755577230,is_dir:false:1@1354731755577230,mtime:false:4@1354731755577230,name:false:57@1354731755577230,revision:false:16@1354731755577230,])
DEBUG [ReadStage:52] 2013-01-08 18:08:36,446 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(463b877e-3f0a-11e2-b990-e0db55018fa4 -delete 
at 1357543038078223- 
[hash:false:36@1354732413504338,is_dir:false:1@1354732413504338,mtime:false:4@1354732413504338,name:false:57@1354732413504338,revision:false:16@1354732413504338,])
DEBUG [ReadStage:52] 2013-01-08 18:08:36,471 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(463ef5c6-3f0a-11e2-b990-e0db55018fa4 -delete 
at 1357543038078223- 
[hash:false:36@1354732413523782,is_dir:false:1@1354732413523782,mtime:false:4@1354732413523782,name:false:57@1354732413523782,revision:false:16@1354732413523782,])


GC related messages:

 INFO [ScheduledTasks:1] 2013-01-09 12:11:17,554 GCInspector.java (line 122) GC 
for ParNew: 426 ms for 2 collections, 6138212856 used; max is 8357150720
 INFO [ScheduledTasks:1] 2013-01-09 12:11:19,819 GCInspector.java (line 122) GC 
for ConcurrentMarkSweep: 324 ms for 1 collections, 6136066400 used; max is 
8357150720
 WARN [ScheduledTasks:1] 2013-01-09 12:11:19,820 GCInspector.java (line 145) 
Heap is 0.7342294767181129 full.  You may need to reduce memtable and/or cache 
sizes.  Cassandra will now flush up to the two largest memtables to free up 
memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-01-09 12:11:19,821 StorageService.java (line 
2855) Flushing CFS(Keyspace='Disco', ColumnFamily='FilesPerBlock') to relieve 
memory pressure
 INFO [ScheduledTasks:1] 2013-01-09 12:11:19,821 ColumnFamilyStore.java (line 
659) Enqueuing flush of Memtable-FilesPerBlock@271892815(3190888/38297827 
serialized/live bytes, 24184 ops)
 INFO [FlushWriter:5] 2013-01-09 12:11:19,822 Memtable.java (line 264) Writing 
Memtable-FilesPerBlock@271892815(3190888/38297827 serialized/live bytes, 24184 
ops)
 INFO [FlushWriter:5] 2013-01-09 12:11:20,118 Memtable.java (line 305) 
Completed flushing 

Re: Query regarding SSTable timestamps and counts

2012-12-10 Thread B. Todd Burruss
my two cents ... i know this thread is a bit old, but the fact that
odd-sized SSTABLEs (usually large ones) will hang around for a while
can be very troublesome on disk space and planning.  our data is
temporal in cassandra, being deleted constantly.  we have seen space
usage in the 1+ TB range when actually there is less than 100gb of
usable data.  this is because the tombstoned data will not be deleted
until it is compacted with its tombstone.  this scenario doesn't
really follow the sizing plan of give yourself 2x disk space due to
compaction.

our fix was to use leveled compaction which maintains very low
overhead and removes tombstoned data fairly quickly.  this is at the
cost of disk I/O, but we are fine with the I/O.



On Tue, Nov 20, 2012 at 5:18 PM, aaron morton aa...@thelastpickle.com wrote:
 upgradetables re-writes every sstable to have the same contents in the
 newest format.

 Agree.
  In the world of compaction, and excluding upgrades, have older sstables is
 expected.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 21/11/2012, at 11:45 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 On Tue, Nov 20, 2012 at 5:23 PM, aaron morton aa...@thelastpickle.com
 wrote:

 My understanding of the compaction process was that since data files keep
 continuously merging we should not have data files with very old last
 modified timestamps

 It is perfectly OK to have very old SSTables.

 But performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.

 upgradetables re-writes every sstable to have the same contents in the
 newest format.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 19/11/2012, at 4:57 PM, Ananth Gundabattula agundabatt...@gmail.com
 wrote:

 Hello Aaron,

 Thanks a lot for the reply.

 Looks like the documentation is confusing. Here is the link I am referring
 to:  http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction


 It does not disable compaction.

 As per the above url,  After running a major compaction, automatic minor
 compactions are no longer triggered, frequently requiring you to manually
 run major compactions on a routine basis. ( Just before the heading Tuning
 Column Family compression in the above link)

 With respect to the replies below :


 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.

 This is for the minor compaction and major compaction should theoretically
 result in one large file irrespective of the number of data files initially?

 This is not something you have to worry about. Unless you are seeing
 1,000's of files using the default compaction.


 Well my worry has been because of the large amount of node movements we have
 done in the ring. We started off with 6 nodes and increased the capacity to
 12 with disproportionate increases every time which resulted in a lot of
 clean of data folders except system, run repair and then a cleanup with an
 aborted attempt in between.

 There were some data.db files older by more than 2 weeks and were not
 modified since then. My understanding of the compaction process was that
 since data files keep continuously merging we should not have data files
 with very old last modified timestamps (assuming there is a good amount of
 writes to the table continuously) I did not have a for sure way of telling
 if everything is alright with the compaction looking at the last modified
 timestamps of all the data.db files.

 What are the compaction issues you are having ?

 Your replies confirm that the timestamps should not be an issue to worry
 about. So I guess I should not be calling them as issues any more.  But
 performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.



 Regards,
 Ananth


 On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.com
 wrote:


 As per datastax documentation, a manual compaction forces the admin to
 start compaction manually and disables the automated compaction (atleast for
 major compactions but not minor compactions )

 It does not disable compaction.
 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.


 1. Does a nodetool stop compaction also force the admin to manually run
 major compaction ( I.e. disable automated major compactions ? )

 No.
 Stop just stops the current compaction.
 Nothing is disabled.

 2. Can a node restart reset the automated major compaction if a node gets
 into a manual mode compaction for whatever reason ?

 Major compaction is not automatic. It is the manual nodetool compact
 command.
 Automatic (minor) compaction is controlled by min_compaction_threshold and
 max_compaction_threshold (for the default compaction 

How to query secondary indexes

2012-11-28 Thread Oren Karmi
Hi,

According to the documentation on Indexes (
http://www.datastax.com/docs/1.1/ddl/indexes ),
in order to use WHERE on a column which is not part of my key, I must
define a secondary index on it. However, I can only use equality comparison
on it but I wish to use other comparisons methods like greater than.

Let's say I have a room with people and every timestamp, I measure
the temperature of the room and number of people. I use the timestamp as my
key and I want to select all timestamps where temperature was over 50
degrees but I can't seem to be able to do it with a regular query even if I
define that column as a secondary index.
SELECT * FROM MyTable WHERE temp  50.4571;

My lame workaround is to define a secondary index on NumOfPeopleInRoom and
than for a specific value
SELECT * FROM MyTable WHERE NumOfPeopleInRoom = 7 AND temp  50.4571;

I'm pretty sure this is not the proper way for me to do this.

How should I attack this? It feels like I'm missing a very basic concept.
I'd appreciate it if your answers include also the option of not changing
my schema.

Thanks!!!


Re: How to query secondary indexes

2012-11-28 Thread Blake Eggleston
You're going to have a problem doing this in a single query because you're
asking cassandra to select a non-contiguous set of rows. Also, to my
knowledge, you can only use non equal operators on clustering keys. The
best solution I could come up with would be to define you table like so:

CREATE TABLE room_data (
room_id uuid,
in_room int,
temp float,
time timestamp,
PRIMARY KEY (room_id, in_room, temp));

Then run 2 queries:
SELECT * FROM room_data WHERE in_room  7;
SELECT * FROM room_data WHERE temp  50.0;

And do an intersection on the results.

I should add the disclaimer that I am relatively new to CQL, so there may
be a better way to do this.

Blake


On Wed, Nov 28, 2012 at 10:02 AM, Oren Karmi oka...@gmail.com wrote:

 Hi,

 According to the documentation on Indexes (
 http://www.datastax.com/docs/1.1/ddl/indexes ),
 in order to use WHERE on a column which is not part of my key, I must
 define a secondary index on it. However, I can only use equality comparison
 on it but I wish to use other comparisons methods like greater than.

 Let's say I have a room with people and every timestamp, I measure
 the temperature of the room and number of people. I use the timestamp as my
 key and I want to select all timestamps where temperature was over 50
 degrees but I can't seem to be able to do it with a regular query even if I
 define that column as a secondary index.
 SELECT * FROM MyTable WHERE temp  50.4571;

 My lame workaround is to define a secondary index on NumOfPeopleInRoom and
 than for a specific value
 SELECT * FROM MyTable WHERE NumOfPeopleInRoom = 7 AND temp  50.4571;

 I'm pretty sure this is not the proper way for me to do this.

 How should I attack this? It feels like I'm missing a very basic concept.
 I'd appreciate it if your answers include also the option of not changing
 my schema.

 Thanks!!!



Re: Query regarding SSTable timestamps and counts

2012-11-20 Thread aaron morton
 My understanding of the compaction process was that since data files keep 
 continuously merging we should not have data files with very old last 
 modified timestamps 
It is perfectly OK to have very old SSTables. 

 But performing an upgradesstables did decrease the number of data files and 
 removed all the data files with the old timestamps. 
upgradetables re-writes every sstable to have the same contents in the newest 
format. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/11/2012, at 4:57 PM, Ananth Gundabattula agundabatt...@gmail.com wrote:

 Hello Aaron,
 
 Thanks a lot for the reply. 
 
 Looks like the documentation is confusing. Here is the link I am referring 
 to:  http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction
 
 
  It does not disable compaction. 
 As per the above url,  After running a major compaction, automatic minor 
 compactions are no longer triggered, frequently requiring you to manually run 
 major compactions on a routine basis. ( Just before the heading Tuning 
 Column Family compression in the above link) 
 
 With respect to the replies below : 
 
 
  it creates one big file, which will not be compacted until there are (by 
  default) 3 other very big files. 
 This is for the minor compaction and major compaction should theoretically 
 result in one large file irrespective of the number of data files initially? 
 
 This is not something you have to worry about. Unless you are seeing 1,000's 
 of files using the default compaction.
 
 Well my worry has been because of the large amount of node movements we have 
 done in the ring. We started off with 6 nodes and increased the capacity to 
 12 with disproportionate increases every time which resulted in a lot of 
 clean of data folders except system, run repair and then a cleanup with an 
 aborted attempt in between.  
 
 There were some data.db files older by more than 2 weeks and were not 
 modified since then. My understanding of the compaction process was that 
 since data files keep continuously merging we should not have data files with 
 very old last modified timestamps (assuming there is a good amount of writes 
 to the table continuously) I did not have a for sure way of telling if 
 everything is alright with the compaction looking at the last modified 
 timestamps of all the data.db files.
 
 What are the compaction issues you are having ? 
 Your replies confirm that the timestamps should not be an issue to worry 
 about. So I guess I should not be calling them as issues any more.  But 
 performing an upgradesstables did decrease the number of data files and 
 removed all the data files with the old timestamps. 
 
 
 
 Regards,
 Ananth  
 
 
 On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.com wrote:
 As per datastax documentation, a manual compaction forces the admin to start 
 compaction manually and disables the automated compaction (atleast for major 
 compactions but not minor compactions )
 It does not disable compaction. 
 it creates one big file, which will not be compacted until there are (by 
 default) 3 other very big files. 
 
 
 1. Does a nodetool stop compaction also force the admin to manually run 
 major compaction ( I.e. disable automated major compactions ? ) 
 No. 
 Stop just stops the current compaction. 
 Nothing is disabled. 
 
 2. Can a node restart reset the automated major compaction if a node gets 
 into a manual mode compaction for whatever reason ? 
 Major compaction is not automatic. It is the manual nodetool compact command. 
 Automatic (minor) compaction is controlled by min_compaction_threshold and 
 max_compaction_threshold (for the default compaction strategy).
 
 3. What is the ideal  number of SSTables for a table in a keyspace ( I mean 
 are there any indicators as to whether my compaction is alright or not ? )  
 This is not something you have to worry about. 
 Unless you are seeing 1,000's of files using the default compaction. 
 
  For example, I have seen SSTables on the disk more than 10 days old wherein 
 there were other SSTables belonging to the same table but much younger than 
 the older SSTables (
 No problems. 
 
 4. Does a upgradesstables fix any compaction issues ? 
 What are the compaction issues you are having ? 
 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/11/2012, at 1:18 AM, Ananth Gundabattula agundabatt...@gmail.com 
 wrote:
 
 
 We have a cluster  running cassandra 1.1.4. On this cluster, 
 
 1. We had to move the nodes around a bit  when we were adding new nodes 
 (there was quite a good amount of node movement ) 
 
 2. We had to stop compactions during some of the days to save some disk  
 space on some of the nodes when they were running very very low on disk 
 spaces. (via nodetool stop COMPACTION)  
 
 
 As per datastax documentation, 

Re: Query regarding SSTable timestamps and counts

2012-11-20 Thread Edward Capriolo
On Tue, Nov 20, 2012 at 5:23 PM, aaron morton aa...@thelastpickle.com wrote:
 My understanding of the compaction process was that since data files keep
 continuously merging we should not have data files with very old last
 modified timestamps

 It is perfectly OK to have very old SSTables.

 But performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.

 upgradetables re-writes every sstable to have the same contents in the
 newest format.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 19/11/2012, at 4:57 PM, Ananth Gundabattula agundabatt...@gmail.com
 wrote:

 Hello Aaron,

 Thanks a lot for the reply.

 Looks like the documentation is confusing. Here is the link I am referring
 to:  http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction


 It does not disable compaction.
 As per the above url,  After running a major compaction, automatic minor
 compactions are no longer triggered, frequently requiring you to manually
 run major compactions on a routine basis. ( Just before the heading Tuning
 Column Family compression in the above link)

 With respect to the replies below :


 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.
 This is for the minor compaction and major compaction should theoretically
 result in one large file irrespective of the number of data files initially?

This is not something you have to worry about. Unless you are seeing
 1,000's of files using the default compaction.

 Well my worry has been because of the large amount of node movements we have
 done in the ring. We started off with 6 nodes and increased the capacity to
 12 with disproportionate increases every time which resulted in a lot of
 clean of data folders except system, run repair and then a cleanup with an
 aborted attempt in between.

 There were some data.db files older by more than 2 weeks and were not
 modified since then. My understanding of the compaction process was that
 since data files keep continuously merging we should not have data files
 with very old last modified timestamps (assuming there is a good amount of
 writes to the table continuously) I did not have a for sure way of telling
 if everything is alright with the compaction looking at the last modified
 timestamps of all the data.db files.

What are the compaction issues you are having ?
 Your replies confirm that the timestamps should not be an issue to worry
 about. So I guess I should not be calling them as issues any more.  But
 performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.



 Regards,
 Ananth


 On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.com
 wrote:

 As per datastax documentation, a manual compaction forces the admin to
 start compaction manually and disables the automated compaction (atleast for
 major compactions but not minor compactions )

 It does not disable compaction.
 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.


 1. Does a nodetool stop compaction also force the admin to manually run
 major compaction ( I.e. disable automated major compactions ? )

 No.
 Stop just stops the current compaction.
 Nothing is disabled.

 2. Can a node restart reset the automated major compaction if a node gets
 into a manual mode compaction for whatever reason ?

 Major compaction is not automatic. It is the manual nodetool compact
 command.
 Automatic (minor) compaction is controlled by min_compaction_threshold and
 max_compaction_threshold (for the default compaction strategy).

 3. What is the ideal  number of SSTables for a table in a keyspace ( I
 mean are there any indicators as to whether my compaction is alright or not
 ? )

 This is not something you have to worry about.
 Unless you are seeing 1,000's of files using the default compaction.

  For example, I have seen SSTables on the disk more than 10 days old
 wherein there were other SSTables belonging to the same table but much
 younger than the older SSTables (

 No problems.

 4. Does a upgradesstables fix any compaction issues ?

 What are the compaction issues you are having ?


 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/11/2012, at 1:18 AM, Ananth Gundabattula agundabatt...@gmail.com
 wrote:


 We have a cluster  running cassandra 1.1.4. On this cluster,

 1. We had to move the nodes around a bit  when we were adding new nodes
 (there was quite a good amount of node movement )

 2. We had to stop compactions during some of the days to save some disk
 space on some of the nodes when they were running very very low on disk
 spaces. (via nodetool stop COMPACTION)


 As per datastax documentation, a manual 

Re: Query regarding SSTable timestamps and counts

2012-11-20 Thread Ananth Gundabattula
Thanks a lot Aaron and Edward.

The mail thread clarifies some things for me.

For letting others know on this thread, running an upgradesstables did
decrease our bloom filter false positive ratios a lot. ( upgradesstables
was run not to upgrade from a casasndra version to a higher cassandra
version but because of all the node movement we had done to upgrade our
cluster in a staggered way with aborted attempts in between and I
understand that upgradesstables was not necessarily required for the high
bloom filter false positives rates we were seeing )


Regards,
Ananth


On Wed, Nov 21, 2012 at 9:45 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Tue, Nov 20, 2012 at 5:23 PM, aaron morton aa...@thelastpickle.com
 wrote:
  My understanding of the compaction process was that since data files keep
  continuously merging we should not have data files with very old last
  modified timestamps
 
  It is perfectly OK to have very old SSTables.
 
  But performing an upgradesstables did decrease the number of data files
 and
  removed all the data files with the old timestamps.
 
  upgradetables re-writes every sstable to have the same contents in the
  newest format.
 
  Cheers
 
  -
  Aaron Morton
  Freelance Cassandra Developer
  New Zealand
 
  @aaronmorton
  http://www.thelastpickle.com
 
  On 19/11/2012, at 4:57 PM, Ananth Gundabattula agundabatt...@gmail.com
  wrote:
 
  Hello Aaron,
 
  Thanks a lot for the reply.
 
  Looks like the documentation is confusing. Here is the link I am
 referring
  to:
 http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction
 
 
  It does not disable compaction.
  As per the above url,  After running a major compaction, automatic minor
  compactions are no longer triggered, frequently requiring you to manually
  run major compactions on a routine basis. ( Just before the heading
 Tuning
  Column Family compression in the above link)
 
  With respect to the replies below :
 
 
  it creates one big file, which will not be compacted until there are (by
  default) 3 other very big files.
  This is for the minor compaction and major compaction should
 theoretically
  result in one large file irrespective of the number of data files
 initially?
 
 This is not something you have to worry about. Unless you are seeing
  1,000's of files using the default compaction.
 
  Well my worry has been because of the large amount of node movements we
 have
  done in the ring. We started off with 6 nodes and increased the capacity
 to
  12 with disproportionate increases every time which resulted in a lot of
  clean of data folders except system, run repair and then a cleanup with
 an
  aborted attempt in between.
 
  There were some data.db files older by more than 2 weeks and were not
  modified since then. My understanding of the compaction process was that
  since data files keep continuously merging we should not have data files
  with very old last modified timestamps (assuming there is a good amount
 of
  writes to the table continuously) I did not have a for sure way of
 telling
  if everything is alright with the compaction looking at the last modified
  timestamps of all the data.db files.
 
 What are the compaction issues you are having ?
  Your replies confirm that the timestamps should not be an issue to worry
  about. So I guess I should not be calling them as issues any more.  But
  performing an upgradesstables did decrease the number of data files and
  removed all the data files with the old timestamps.
 
 
 
  Regards,
  Ananth
 
 
  On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.com
  wrote:
 
  As per datastax documentation, a manual compaction forces the admin to
  start compaction manually and disables the automated compaction
 (atleast for
  major compactions but not minor compactions )
 
  It does not disable compaction.
  it creates one big file, which will not be compacted until there are (by
  default) 3 other very big files.
 
 
  1. Does a nodetool stop compaction also force the admin to manually run
  major compaction ( I.e. disable automated major compactions ? )
 
  No.
  Stop just stops the current compaction.
  Nothing is disabled.
 
  2. Can a node restart reset the automated major compaction if a node
 gets
  into a manual mode compaction for whatever reason ?
 
  Major compaction is not automatic. It is the manual nodetool compact
  command.
  Automatic (minor) compaction is controlled by min_compaction_threshold
 and
  max_compaction_threshold (for the default compaction strategy).
 
  3. What is the ideal  number of SSTables for a table in a keyspace ( I
  mean are there any indicators as to whether my compaction is alright or
 not
  ? )
 
  This is not something you have to worry about.
  Unless you are seeing 1,000's of files using the default compaction.
 
   For example, I have seen SSTables on the disk more than 10 days old
  wherein there were other SSTables belonging to the same table but much
  

Re: Query regarding SSTable timestamps and counts

2012-11-20 Thread aaron morton
 upgradetables re-writes every sstable to have the same contents in the
 newest format.
Agree. 
 In the world of compaction, and excluding upgrades, have older sstables is 
expected.

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/11/2012, at 11:45 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 On Tue, Nov 20, 2012 at 5:23 PM, aaron morton aa...@thelastpickle.com wrote:
 My understanding of the compaction process was that since data files keep
 continuously merging we should not have data files with very old last
 modified timestamps
 
 It is perfectly OK to have very old SSTables.
 
 But performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.
 
 upgradetables re-writes every sstable to have the same contents in the
 newest format.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/11/2012, at 4:57 PM, Ananth Gundabattula agundabatt...@gmail.com
 wrote:
 
 Hello Aaron,
 
 Thanks a lot for the reply.
 
 Looks like the documentation is confusing. Here is the link I am referring
 to:  http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction
 
 
 It does not disable compaction.
 As per the above url,  After running a major compaction, automatic minor
 compactions are no longer triggered, frequently requiring you to manually
 run major compactions on a routine basis. ( Just before the heading Tuning
 Column Family compression in the above link)
 
 With respect to the replies below :
 
 
 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.
 This is for the minor compaction and major compaction should theoretically
 result in one large file irrespective of the number of data files initially?
 
 This is not something you have to worry about. Unless you are seeing
 1,000's of files using the default compaction.
 
 Well my worry has been because of the large amount of node movements we have
 done in the ring. We started off with 6 nodes and increased the capacity to
 12 with disproportionate increases every time which resulted in a lot of
 clean of data folders except system, run repair and then a cleanup with an
 aborted attempt in between.
 
 There were some data.db files older by more than 2 weeks and were not
 modified since then. My understanding of the compaction process was that
 since data files keep continuously merging we should not have data files
 with very old last modified timestamps (assuming there is a good amount of
 writes to the table continuously) I did not have a for sure way of telling
 if everything is alright with the compaction looking at the last modified
 timestamps of all the data.db files.
 
 What are the compaction issues you are having ?
 Your replies confirm that the timestamps should not be an issue to worry
 about. So I guess I should not be calling them as issues any more.  But
 performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.
 
 
 
 Regards,
 Ananth
 
 
 On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.com
 wrote:
 
 As per datastax documentation, a manual compaction forces the admin to
 start compaction manually and disables the automated compaction (atleast for
 major compactions but not minor compactions )
 
 It does not disable compaction.
 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.
 
 
 1. Does a nodetool stop compaction also force the admin to manually run
 major compaction ( I.e. disable automated major compactions ? )
 
 No.
 Stop just stops the current compaction.
 Nothing is disabled.
 
 2. Can a node restart reset the automated major compaction if a node gets
 into a manual mode compaction for whatever reason ?
 
 Major compaction is not automatic. It is the manual nodetool compact
 command.
 Automatic (minor) compaction is controlled by min_compaction_threshold and
 max_compaction_threshold (for the default compaction strategy).
 
 3. What is the ideal  number of SSTables for a table in a keyspace ( I
 mean are there any indicators as to whether my compaction is alright or not
 ? )
 
 This is not something you have to worry about.
 Unless you are seeing 1,000's of files using the default compaction.
 
 For example, I have seen SSTables on the disk more than 10 days old
 wherein there were other SSTables belonging to the same table but much
 younger than the older SSTables (
 
 No problems.
 
 4. Does a upgradesstables fix any compaction issues ?
 
 What are the compaction issues you are having ?
 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/11/2012, at 1:18 AM, Ananth Gundabattula agundabatt...@gmail.com

Re: Collections, query for contains?

2012-11-19 Thread Edward Capriolo
This was my first question after I git the inserts working. Hive has udfs
like array contains. It also has lateral view syntax that is similar to
transposed.

On Monday, November 19, 2012, Timmy Turner timm.t...@gmail.com wrote:
 Is there no option to query for the contents of a collection?
 Something like
   select * from cf where c_list contains('some_value')
 or
   select * from cf where c_map contains('some_key')
 or
   select * from cf where c_map['some_key'] contains('some_value')


Re: Collections, query for contains?

2012-11-19 Thread Sylvain Lebresne
It's not supported yet, no, but we have a ticket for it:
https://issues.apache.org/jira/browse/CASSANDRA-4511


On Mon, Nov 19, 2012 at 3:56 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 This was my first question after I git the inserts working. Hive has udfs
 like array contains. It also has lateral view syntax that is similar to
 transposed.


 On Monday, November 19, 2012, Timmy Turner timm.t...@gmail.com wrote:
  Is there no option to query for the contents of a collection?
  Something like
select * from cf where c_list contains('some_value')
  or
select * from cf where c_map contains('some_key')
  or
select * from cf where c_map['some_key'] contains('some_value')



Re: Query regarding SSTable timestamps and counts

2012-11-19 Thread Rob Coli
On Sun, Nov 18, 2012 at 7:57 PM, Ananth Gundabattula
agundabatt...@gmail.com wrote:
 As per the above url,  After running a major compaction, automatic minor
 compactions are no longer triggered, frequently requiring you to manually
 run major compactions on a routine basis. ( Just before the heading Tuning
 Column Family compression in the above link)

This inaccurate statement has been questioned a few times on the
mailing list. Generally what happens is people discuss it for about 10
emails and then give up because they can't really make sense of it. If
you google for cassandra-user and that sentence above, you should find
the threads. I suggest mailing d...@datastax.com, explaining your
confusion, and asking them to fix it.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Query regarding SSTable timestamps and counts

2012-11-18 Thread aaron morton
 As per datastax documentation, a manual compaction forces the admin to start 
 compaction manually and disables the automated compaction (atleast for major 
 compactions but not minor compactions )
It does not disable compaction. 
it creates one big file, which will not be compacted until there are (by 
default) 3 other very big files. 


 1. Does a nodetool stop compaction also force the admin to manually run major 
 compaction ( I.e. disable automated major compactions ? ) 
No. 
Stop just stops the current compaction. 
Nothing is disabled. 

 2. Can a node restart reset the automated major compaction if a node gets 
 into a manual mode compaction for whatever reason ? 
Major compaction is not automatic. It is the manual nodetool compact command. 
Automatic (minor) compaction is controlled by min_compaction_threshold and 
max_compaction_threshold (for the default compaction strategy).

 3. What is the ideal  number of SSTables for a table in a keyspace ( I mean 
 are there any indicators as to whether my compaction is alright or not ? )  
This is not something you have to worry about. 
Unless you are seeing 1,000's of files using the default compaction. 

  For example, I have seen SSTables on the disk more than 10 days old wherein 
 there were other SSTables belonging to the same table but much younger than 
 the older SSTables (
No problems. 

 4. Does a upgradesstables fix any compaction issues ? 
What are the compaction issues you are having ? 


Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/11/2012, at 1:18 AM, Ananth Gundabattula agundabatt...@gmail.com wrote:

 
 We have a cluster  running cassandra 1.1.4. On this cluster, 
 
 1. We had to move the nodes around a bit  when we were adding new nodes 
 (there was quite a good amount of node movement ) 
 
 2. We had to stop compactions during some of the days to save some disk  
 space on some of the nodes when they were running very very low on disk 
 spaces. (via nodetool stop COMPACTION)  
 
 
 As per datastax documentation, a manual compaction forces the admin to start 
 compaction manually and disables the automated compaction (atleast for major 
 compactions but not minor compactions )
 
 
 Here are the questions I have regarding compaction: 
 
 1. Does a nodetool stop compaction also force the admin to manually run major 
 compaction ( I.e. disable automated major compactions ? ) 
 
 2. Can a node restart reset the automated major compaction if a node gets 
 into a manual mode compaction for whatever reason ? 
 
 3. What is the ideal  number of SSTables for a table in a keyspace ( I mean 
 are there any indicators as to whether my compaction is alright or not ? )  . 
 For example, I have seen SSTables on the disk more than 10 days old wherein 
 there were other SSTables belonging to the same table but much younger than 
 the older SSTables ( The node movement and repair and cleanup happened 
 between the older SSTables and the new SSTables being touched/modified)
 
 4. Does a upgradesstables fix any compaction issues ? 
 
 Regards,
 Ananth



Re: Query regarding SSTable timestamps and counts

2012-11-18 Thread Ananth Gundabattula
Hello Aaron,

Thanks a lot for the reply.

Looks like the documentation is confusing. Here is the link I am referring
to:  http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction


 It does not disable compaction.
As per the above url,  After running a major compaction, automatic minor
compactions are no longer triggered, frequently requiring you to manually
run major compactions on a routine basis. ( Just before the heading Tuning
Column Family compression in the above link)

With respect to the replies below :


 it creates one big file, which will not be compacted until there are (by
default) 3 other very big files.
This is for the minor compaction and major compaction
should theoretically result in one large file irrespective of the number of
data files initially?

This is not something you have to worry about. Unless you are seeing
1,000's of files using the default compaction.

Well my worry has been because of the large amount of node movements we
have done in the ring. We started off with 6 nodes and increased the
capacity to 12 with disproportionate increases every time which resulted in
a lot of clean of data folders except system, run repair and then a cleanup
with an aborted attempt in between.

There were some data.db files older by more than 2 weeks and were not
modified since then. My understanding of the compaction process was that
since data files keep continuously merging we should not have data files
with very old last modified timestamps (assuming there is a good amount of
writes to the table continuously) I did not have a for sure way of telling
if everything is alright with the compaction looking at the last modified
timestamps of all the data.db files.

What are the compaction issues you are having ?
Your replies confirm that the timestamps should not be an issue to worry
about. So I guess I should not be calling them as issues any more.  But
performing an upgradesstables did decrease the number of data files and
removed all the data files with the old timestamps.



Regards,
Ananth


On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.comwrote:

 As per datastax documentation, a manual compaction forces the admin to
 start compaction manually and disables the automated compaction (atleast
 for major compactions but not minor compactions )

 It does not disable compaction.
 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.


 1. Does a nodetool stop compaction also force the admin to manually run
 major compaction ( I.e. disable automated major compactions ? )

 No.
 Stop just stops the current compaction.
 Nothing is disabled.

 2. Can a node restart reset the automated major compaction if a node gets
 into a manual mode compaction for whatever reason ?

 Major compaction is not automatic. It is the manual nodetool compact
 command.
 Automatic (minor) compaction is controlled by min_compaction_threshold and
 max_compaction_threshold (for the default compaction strategy).

 3. What is the ideal  number of SSTables for a table in a keyspace ( I
 mean are there any indicators as to whether my compaction is alright or not
 ? )

 This is not something you have to worry about.
 Unless you are seeing 1,000's of files using the default compaction.

  For example, I have seen SSTables on the disk more than 10 days old
 wherein there were other SSTables belonging to the same table but much
 younger than the older SSTables (

 No problems.

 4. Does a upgradesstables fix any compaction issues ?

 What are the compaction issues you are having ?


 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/11/2012, at 1:18 AM, Ananth Gundabattula agundabatt...@gmail.com
 wrote:


 We have a cluster  running cassandra 1.1.4. On this cluster,

 1. We had to move the nodes around a bit  when we were adding new nodes
 (there was quite a good amount of node movement )

 2. We had to stop compactions during some of the days to save some disk
  space on some of the nodes when they were running very very low on disk
 spaces. (via nodetool stop COMPACTION)


 As per datastax documentation, a manual compaction forces the admin to
 start compaction manually and disables the automated compaction (atleast
 for major compactions but not minor compactions )


 Here are the questions I have regarding compaction:

 1. Does a nodetool stop compaction also force the admin to manually run
 major compaction ( I.e. disable automated major compactions ? )

 2. Can a node restart reset the automated major compaction if a node gets
 into a manual mode compaction for whatever reason ?

 3. What is the ideal  number of SSTables for a table in a keyspace ( I
 mean are there any indicators as to whether my compaction is alright or not
 ? )  . For example, I have seen SSTables on the disk more than 10 days old
 wherein there were other SSTables belonging to the 

Re: Strange delay in query

2012-11-13 Thread aaron morton
 I don't think that statement is accurate.
Which part ?

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 13/11/2012, at 6:31 AM, Binh Nguyen binhn...@gmail.com wrote:

 I don't think that statement is accurate. The minor compaction is still 
 triggered for small sstables but for the big sstables it may or may not.
 By default Cassandra will wait until it finds 4 sstables of the same size to 
 trigger the compaction so if the sstables are big then it may take a while to 
 be compacted.
 If you are sure that you have a lot of tombstones and they will be deleted 
 then I think you are safe to go.
 
 -Binh
 
 On Sun, Nov 11, 2012 at 1:51 AM, André Cruz andre.c...@co.sapo.pt wrote:
 On Nov 11, 2012, at 12:01 AM, Binh Nguyen binhn...@gmail.com wrote:
 
 FYI: Repair does not remove tombstones. To remove tombstones you need to run 
 compaction.
 If you have a lot of data then make sure you run compaction on all nodes 
 before running repair. We had a big trouble with our system regarding 
 tombstone and it took us long time to figure out the reason. It turned out 
 that repair process also transfers TTLed data (compaction is not triggered 
 yet) to the other nodes even that data was removed from the other nodes in 
 the compaction phase before that.
 
 
 Aren't compactions triggered automatically? At least minor compactions. Also, 
 I read this in 
 http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction :
 
  After running a major compaction, automatic minor compactions are no longer 
 triggered, frequently requiring you to manually run major compactions on a 
 routine basis.
 DataStax does not recommend major compaction.
 
 So I'm unsure whether to start triggering manually these compactions… I guess 
 I'll have to experiment with it.
 
 Thanks!
 
 André
 



Re: Strange delay in query

2012-11-13 Thread André Cruz
On Nov 13, 2012, at 8:54 AM, aaron morton aa...@thelastpickle.com wrote:

 I don't think that statement is accurate.
 Which part ?

Probably this part:
After running a major compaction, automatic minor compactions are no longer 
triggered, frequently requiring you to manually run major compactions on a 
routine basis.

From what I read what happens is that it takes a lot longer for minor 
compactions to be triggered because 3 more files with the size equal to the 
compacted one have to be created?

André

Re: Strange delay in query

2012-11-13 Thread J. D. Jordan
Correct

On Nov 13, 2012, at 5:21 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 13, 2012, at 8:54 AM, aaron morton aa...@thelastpickle.com wrote:
 
 I don't think that statement is accurate.
 Which part ?
 
 Probably this part:
 After running a major compaction, automatic minor compactions are no longer 
 triggered, frequently requiring you to manually run major compactions on a 
 routine basis.
 
 From what I read what happens is that it takes a lot longer for minor 
 compactions to be triggered because 3 more files with the size equal to the 
 compacted one have to be created?
 
 André


Re: Strange delay in query

2012-11-13 Thread aaron morton
Minor compactions will still be triggered whenever a size tier gets 4+ sstables 
(for the default compaction strategy). So it does not affect new data. 

It just takes longer for the biggest size tier to get to 4 files. So it takes 
longer to compact the big output from the major compaction. 

Assuming your data roughly follows a generational model, where newer data is 
written to often and older data is mostly read from. This can mean garbage 
hanging around in the big old file and *potentially* slowing things down. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/11/2012, at 12:21 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 13, 2012, at 8:54 AM, aaron morton aa...@thelastpickle.com wrote:
 
 I don't think that statement is accurate.
 Which part ?
 
 Probably this part:
 After running a major compaction, automatic minor compactions are no longer 
 triggered, frequently requiring you to manually run major compactions on a 
 routine basis.
 
 From what I read what happens is that it takes a lot longer for minor 
 compactions to be triggered because 3 more files with the size equal to the 
 compacted one have to be created?
 
 André



Re: Strange delay in query

2012-11-12 Thread Binh Nguyen
I don't think that statement is accurate. The minor compaction is still
triggered for small sstables but for the big sstables it may or may not.
By default Cassandra will wait until it finds 4 sstables of the same size
to trigger the compaction so if the sstables are big then it may take a
while to be compacted.
If you are sure that you have a lot of tombstones and they will be deleted
then I think you are safe to go.

-Binh

On Sun, Nov 11, 2012 at 1:51 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 11, 2012, at 12:01 AM, Binh Nguyen binhn...@gmail.com wrote:

 FYI: Repair does not remove tombstones. To remove tombstones you need to
 run compaction.
 If you have a lot of data then make sure you run compaction on all nodes
 before running repair. We had a big trouble with our system regarding
 tombstone and it took us long time to figure out the reason. It turned out
 that repair process also transfers TTLed data (compaction is not triggered
 yet) to the other nodes even that data was removed from the other nodes in
 the compaction phase before that.


 Aren't compactions triggered automatically? At least minor compactions.
 Also, I read this in
 http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction :

  After running a major compaction, automatic minor compactions are no
 longer triggered, frequently requiring you to manually run major
 compactions on a routine basis.
 DataStax does *not* recommend major compaction.

 So I'm unsure whether to start triggering manually these compactions… I
 guess I'll have to experiment with it.

 Thanks!

 André



Re: Strange delay in query

2012-11-11 Thread André Cruz
On Nov 11, 2012, at 12:01 AM, Binh Nguyen binhn...@gmail.com wrote:

 FYI: Repair does not remove tombstones. To remove tombstones you need to run 
 compaction.
 If you have a lot of data then make sure you run compaction on all nodes 
 before running repair. We had a big trouble with our system regarding 
 tombstone and it took us long time to figure out the reason. It turned out 
 that repair process also transfers TTLed data (compaction is not triggered 
 yet) to the other nodes even that data was removed from the other nodes in 
 the compaction phase before that.
 

Aren't compactions triggered automatically? At least minor compactions. Also, I 
read this in 
http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction :

After running a major compaction, automatic minor compactions are no longer 
triggered, frequently requiring you to manually run major compactions on a 
routine basis.
DataStax does not recommend major compaction.

So I'm unsure whether to start triggering manually these compactions… I guess 
I'll have to experiment with it.

Thanks!

André

Re: Strange delay in query

2012-11-11 Thread aaron morton
If you have a long lived row with a lot of tombstones or overwrites, it's often 
more efficient to select a known list of columns. There are short circuits in 
the read path that can avoid older tombstones filled fragments of the row being 
read. (Obviously this is hard to do if you don't know the names of the columns).

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/11/2012, at 10:51 PM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 11, 2012, at 12:01 AM, Binh Nguyen binhn...@gmail.com wrote:
 
 FYI: Repair does not remove tombstones. To remove tombstones you need to run 
 compaction.
 If you have a lot of data then make sure you run compaction on all nodes 
 before running repair. We had a big trouble with our system regarding 
 tombstone and it took us long time to figure out the reason. It turned out 
 that repair process also transfers TTLed data (compaction is not triggered 
 yet) to the other nodes even that data was removed from the other nodes in 
 the compaction phase before that.
 
 
 Aren't compactions triggered automatically? At least minor compactions. Also, 
 I read this in 
 http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction :
 
 After running a major compaction, automatic minor compactions are no longer 
 triggered, frequently requiring you to manually run major compactions on a 
 routine basis.
 DataStax does not recommend major compaction.
 
 So I'm unsure whether to start triggering manually these compactions… I guess 
 I'll have to experiment with it.
 
 Thanks!
 
 André



Re: Strange delay in query

2012-11-10 Thread Binh Nguyen
FYI: Repair does not remove tombstones. To remove tombstones you need to
run compaction.
If you have a lot of data then make sure you run compaction on all nodes
before running repair. We had a big trouble with our system regarding
tombstone and it took us long time to figure out the reason. It turned out
that repair process also transfers TTLed data (compaction is not triggered
yet) to the other nodes even that data was removed from the other nodes in
the compaction phase before that.

-Binh

On Fri, Nov 9, 2012 at 1:34 PM, André Cruz andre.c...@co.sapo.pt wrote:

 That must be it. I dumped the sstables to json and there are lots of
 records, including ones that are returned to my application, that have the
 deletedAt attribute. I think this is because the regular repair job was not
 running for some time, surely more than the grace period, and lots of
 tombstones stayed behind even though we are running repair regularly now.

 Thanks!
 André

 On Nov 8, 2012, at 10:51 PM, Josep Blanquer blanq...@rightscale.com
 wrote:

 Can it be that you have tons and tons of tombstoned columns in the middle
 of these two? I've seen plenty of performance issues with wide
 rows littered with column tombstones (you could check with dumping the
 sstables...)

 Just a thought...

 Josep M.

 On Thu, Nov 8, 2012 at 12:23 PM, André Cruz andre.c...@co.sapo.pt wrote:

 These are the two columns in question:

 = (super_column=13957152-234b-11e2-92bc-e0db550199f4,
  (column=attributes, value=, timestamp=1351681613263657)
  (column=blocks,
 value=A4edo5MhHvojv3Ihx_JkFMsF3ypthtBvAZkoRHsjulw06pez86OHch3K3OpmISnDjHODPoCf69bKcuAZSJj-4Q,
 timestamp=1351681613263657)
  (column=hash,
 value=8_p2QaeRaX_QwJbUWQ07ZqlNHei7ixu0MHxgu9oennfYOGfyH6EsEe_LYO8V8EC_1NPL44Gx8B7UhYV9VSb7Lg,
 timestamp=1351681613263657)
  (column=icon, value=image_jpg, timestamp=1351681613263657)
  (column=is_deleted, value=true, timestamp=1351681613263657)
  (column=is_dir, value=false, timestamp=1351681613263657)
  (column=mime_type, value=image/jpeg, timestamp=1351681613263657)
  (column=mtime, value=1351646803, timestamp=1351681613263657)
  (column=name, value=/Mobile Photos/Photo 2012-10-28 17_13_50.jpeg,
 timestamp=1351681613263657)
  (column=revision, value=13957152-234b-11e2-92bc-e0db550199f4,
 timestamp=1351681613263657)
  (column=size, value=1379001, timestamp=1351681613263657)
  (column=thumb_exists, value=true, timestamp=1351681613263657))
 = (super_column=40b7ae4e-2449-11e2-8610-e0db550199f4,
  (column=attributes, value={posix: 420}, timestamp=1351790781154800)
  (column=blocks,
 value=9UCDkHNb8-8LuKr2bv9PjKcWCT0v7FCZa0ebNSflES4-o7QD6eYschVaweCKSbR29Dq2IeGl_Cu7BVnYJYphTQ,
 timestamp=1351790781154800)
  (column=hash,
 value=kao2EV8jw_wN4EBoMkCXZWCwg3qQ0X6m9_X9JIGkEkiGKJE_JeKgkdoTAkAefXgGtyhChuhWPlWMxl_tX7VZUw,
 timestamp=1351790781154800)
  (column=icon, value=text_txt, timestamp=1351790781154800)
  (column=is_dir, value=false, timestamp=1351790781154800)
  (column=mime_type, value=text/plain, timestamp=1351790781154800)
  (column=mtime, value=1351378576, timestamp=1351790781154800)
  (column=name, value=/Documents/VIMDocument.txt,
 timestamp=1351790781154800)
  (column=revision, value=40b7ae4e-2449-11e2-8610-e0db550199f4,
 timestamp=1351790781154800)
  (column=size, value=13, timestamp=1351790781154800)
  (column=thumb_exists, value=false, timestamp=1351790781154800))


 I don't think their size is an issue here.

 André

 On Nov 8, 2012, at 6:04 PM, Andrey Ilinykh ailin...@gmail.com wrote:

 What is the size of columns? Probably those two are huge.


 On Thu, Nov 8, 2012 at 4:01 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 7, 2012, at 12:15 PM, André Cruz andre.c...@co.sapo.pt wrote:

  This error also happens on my application that uses pycassa, so I
 don't think this is the same bug.

 I have narrowed it down to a slice between two consecutive columns.
 Observe this behaviour using pycassa:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('13957152-234b-11e2-92bc-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:55:51,170 pycassa_library.pool:30 6849
 139928791262976 Connection 52905488 (xxx:9160) was checked out from pool
 51715344
 DEBUG 2012-11-08 11:55:53,415 pycassa_library.pool:37 6849
 139928791262976 Connection 52905488 (xxx:9160) was checked in to pool
 51715344
 [UUID('13957152-234b-11e2-92bc-e0db550199f4'),
 UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')]

 A two column slice took more than 2s to return. If I request the next 2
 column slice:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:57:32,750 pycassa_library.pool:30 6849
 139928791262976 Connection 52904912 (xxx:9160) was checked out from pool
 51715344
 DEBUG 2012-11-08 

Re: Strange delay in query

2012-11-09 Thread André Cruz
That must be it. I dumped the sstables to json and there are lots of records, 
including ones that are returned to my application, that have the deletedAt 
attribute. I think this is because the regular repair job was not running for 
some time, surely more than the grace period, and lots of tombstones stayed 
behind even though we are running repair regularly now.

Thanks!
André

On Nov 8, 2012, at 10:51 PM, Josep Blanquer blanq...@rightscale.com wrote:

 Can it be that you have tons and tons of tombstoned columns in the middle of 
 these two? I've seen plenty of performance issues with wide rows littered 
 with column tombstones (you could check with dumping the sstables...)
 
 Just a thought...
 
 Josep M.
 
 On Thu, Nov 8, 2012 at 12:23 PM, André Cruz andre.c...@co.sapo.pt wrote:
 These are the two columns in question:
 
 = (super_column=13957152-234b-11e2-92bc-e0db550199f4,
  (column=attributes, value=, timestamp=1351681613263657)
  (column=blocks, 
 value=A4edo5MhHvojv3Ihx_JkFMsF3ypthtBvAZkoRHsjulw06pez86OHch3K3OpmISnDjHODPoCf69bKcuAZSJj-4Q,
  timestamp=1351681613263657)
  (column=hash, 
 value=8_p2QaeRaX_QwJbUWQ07ZqlNHei7ixu0MHxgu9oennfYOGfyH6EsEe_LYO8V8EC_1NPL44Gx8B7UhYV9VSb7Lg,
  timestamp=1351681613263657)
  (column=icon, value=image_jpg, timestamp=1351681613263657)
  (column=is_deleted, value=true, timestamp=1351681613263657)
  (column=is_dir, value=false, timestamp=1351681613263657)
  (column=mime_type, value=image/jpeg, timestamp=1351681613263657)
  (column=mtime, value=1351646803, timestamp=1351681613263657)
  (column=name, value=/Mobile Photos/Photo 2012-10-28 17_13_50.jpeg, 
 timestamp=1351681613263657)
  (column=revision, value=13957152-234b-11e2-92bc-e0db550199f4, 
 timestamp=1351681613263657)
  (column=size, value=1379001, timestamp=1351681613263657)
  (column=thumb_exists, value=true, timestamp=1351681613263657))
 = (super_column=40b7ae4e-2449-11e2-8610-e0db550199f4,
  (column=attributes, value={posix: 420}, timestamp=1351790781154800)
  (column=blocks, 
 value=9UCDkHNb8-8LuKr2bv9PjKcWCT0v7FCZa0ebNSflES4-o7QD6eYschVaweCKSbR29Dq2IeGl_Cu7BVnYJYphTQ,
  timestamp=1351790781154800)
  (column=hash, 
 value=kao2EV8jw_wN4EBoMkCXZWCwg3qQ0X6m9_X9JIGkEkiGKJE_JeKgkdoTAkAefXgGtyhChuhWPlWMxl_tX7VZUw,
  timestamp=1351790781154800)
  (column=icon, value=text_txt, timestamp=1351790781154800)
  (column=is_dir, value=false, timestamp=1351790781154800)
  (column=mime_type, value=text/plain, timestamp=1351790781154800)
  (column=mtime, value=1351378576, timestamp=1351790781154800)
  (column=name, value=/Documents/VIMDocument.txt, 
 timestamp=1351790781154800)
  (column=revision, value=40b7ae4e-2449-11e2-8610-e0db550199f4, 
 timestamp=1351790781154800)
  (column=size, value=13, timestamp=1351790781154800)
  (column=thumb_exists, value=false, timestamp=1351790781154800))
 
 
 I don't think their size is an issue here.
 
 André
 
 On Nov 8, 2012, at 6:04 PM, Andrey Ilinykh ailin...@gmail.com wrote:
 
 What is the size of columns? Probably those two are huge.
 
 
 On Thu, Nov 8, 2012 at 4:01 AM, André Cruz andre.c...@co.sapo.pt wrote:
 On Nov 7, 2012, at 12:15 PM, André Cruz andre.c...@co.sapo.pt wrote:
 
  This error also happens on my application that uses pycassa, so I don't 
  think this is the same bug.
 
 I have narrowed it down to a slice between two consecutive columns. Observe 
 this behaviour using pycassa:
 
  DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
   column_count=2, 
  column_start=uuid.UUID('13957152-234b-11e2-92bc-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:55:51,170 pycassa_library.pool:30 6849 139928791262976 
 Connection 52905488 (xxx:9160) was checked out from pool 51715344
 DEBUG 2012-11-08 11:55:53,415 pycassa_library.pool:37 6849 139928791262976 
 Connection 52905488 (xxx:9160) was checked in to pool 51715344
 [UUID('13957152-234b-11e2-92bc-e0db550199f4'), 
 UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')]
 
 A two column slice took more than 2s to return. If I request the next 2 
 column slice:
 
  DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
   column_count=2, 
  column_start=uuid.UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:57:32,750 pycassa_library.pool:30 6849 139928791262976 
 Connection 52904912 (xxx:9160) was checked out from pool 51715344
 DEBUG 2012-11-08 11:57:32,774 pycassa_library.pool:37 6849 139928791262976 
 Connection 52904912 (xxx:9160) was checked in to pool 51715344
 [UUID('40b7ae4e-2449-11e2-8610-e0db550199f4'), 
 UUID('a364b028-2449-11e2-8882-e0db550199f4')]
 
 This takes 20msec... Is there a rational explanation for this different 
 behaviour? Is there some threshold that I'm running into? Is there any way 
 to obtain more debugging information about this problem?
 
 Thanks,
 André
 
 
 



Re: Strange delay in query

2012-11-08 Thread André Cruz
On Nov 7, 2012, at 12:15 PM, André Cruz andre.c...@co.sapo.pt wrote:

 This error also happens on my application that uses pycassa, so I don't think 
 this is the same bug.

I have narrowed it down to a slice between two consecutive columns. Observe 
this behaviour using pycassa:

 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
  column_count=2, 
 column_start=uuid.UUID('13957152-234b-11e2-92bc-e0db550199f4')).keys()
DEBUG 2012-11-08 11:55:51,170 pycassa_library.pool:30 6849 139928791262976 
Connection 52905488 (xxx:9160) was checked out from pool 51715344
DEBUG 2012-11-08 11:55:53,415 pycassa_library.pool:37 6849 139928791262976 
Connection 52905488 (xxx:9160) was checked in to pool 51715344
[UUID('13957152-234b-11e2-92bc-e0db550199f4'), 
UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')]

A two column slice took more than 2s to return. If I request the next 2 column 
slice:

 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
  column_count=2, 
 column_start=uuid.UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')).keys()
DEBUG 2012-11-08 11:57:32,750 pycassa_library.pool:30 6849 139928791262976 
Connection 52904912 (xxx:9160) was checked out from pool 51715344
DEBUG 2012-11-08 11:57:32,774 pycassa_library.pool:37 6849 139928791262976 
Connection 52904912 (xxx:9160) was checked in to pool 51715344
[UUID('40b7ae4e-2449-11e2-8610-e0db550199f4'), 
UUID('a364b028-2449-11e2-8882-e0db550199f4')]

This takes 20msec... Is there a rational explanation for this different 
behaviour? Is there some threshold that I'm running into? Is there any way to 
obtain more debugging information about this problem?

Thanks,
André

Re: Strange delay in query

2012-11-08 Thread Andrey Ilinykh
What is the size of columns? Probably those two are huge.


On Thu, Nov 8, 2012 at 4:01 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 7, 2012, at 12:15 PM, André Cruz andre.c...@co.sapo.pt wrote:

  This error also happens on my application that uses pycassa, so I don't
 think this is the same bug.

 I have narrowed it down to a slice between two consecutive columns.
 Observe this behaviour using pycassa:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('13957152-234b-11e2-92bc-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:55:51,170 pycassa_library.pool:30 6849 139928791262976
 Connection 52905488 (xxx:9160) was checked out from pool 51715344
 DEBUG 2012-11-08 11:55:53,415 pycassa_library.pool:37 6849 139928791262976
 Connection 52905488 (xxx:9160) was checked in to pool 51715344
 [UUID('13957152-234b-11e2-92bc-e0db550199f4'),
 UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')]

 A two column slice took more than 2s to return. If I request the next 2
 column slice:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:57:32,750 pycassa_library.pool:30 6849 139928791262976
 Connection 52904912 (xxx:9160) was checked out from pool 51715344
 DEBUG 2012-11-08 11:57:32,774 pycassa_library.pool:37 6849 139928791262976
 Connection 52904912 (xxx:9160) was checked in to pool 51715344
 [UUID('40b7ae4e-2449-11e2-8610-e0db550199f4'),
 UUID('a364b028-2449-11e2-8882-e0db550199f4')]

 This takes 20msec... Is there a rational explanation for this different
 behaviour? Is there some threshold that I'm running into? Is there any way
 to obtain more debugging information about this problem?

 Thanks,
 André


Re: Strange delay in query

2012-11-08 Thread Josep Blanquer
Can it be that you have tons and tons of tombstoned columns in the middle
of these two? I've seen plenty of performance issues with wide
rows littered with column tombstones (you could check with dumping the
sstables...)

Just a thought...

Josep M.

On Thu, Nov 8, 2012 at 12:23 PM, André Cruz andre.c...@co.sapo.pt wrote:

 These are the two columns in question:

 = (super_column=13957152-234b-11e2-92bc-e0db550199f4,
  (column=attributes, value=, timestamp=1351681613263657)
  (column=blocks,
 value=A4edo5MhHvojv3Ihx_JkFMsF3ypthtBvAZkoRHsjulw06pez86OHch3K3OpmISnDjHODPoCf69bKcuAZSJj-4Q,
 timestamp=1351681613263657)
  (column=hash,
 value=8_p2QaeRaX_QwJbUWQ07ZqlNHei7ixu0MHxgu9oennfYOGfyH6EsEe_LYO8V8EC_1NPL44Gx8B7UhYV9VSb7Lg,
 timestamp=1351681613263657)
  (column=icon, value=image_jpg, timestamp=1351681613263657)
  (column=is_deleted, value=true, timestamp=1351681613263657)
  (column=is_dir, value=false, timestamp=1351681613263657)
  (column=mime_type, value=image/jpeg, timestamp=1351681613263657)
  (column=mtime, value=1351646803, timestamp=1351681613263657)
  (column=name, value=/Mobile Photos/Photo 2012-10-28 17_13_50.jpeg,
 timestamp=1351681613263657)
  (column=revision, value=13957152-234b-11e2-92bc-e0db550199f4,
 timestamp=1351681613263657)
  (column=size, value=1379001, timestamp=1351681613263657)
  (column=thumb_exists, value=true, timestamp=1351681613263657))
 = (super_column=40b7ae4e-2449-11e2-8610-e0db550199f4,
  (column=attributes, value={posix: 420}, timestamp=1351790781154800)
  (column=blocks,
 value=9UCDkHNb8-8LuKr2bv9PjKcWCT0v7FCZa0ebNSflES4-o7QD6eYschVaweCKSbR29Dq2IeGl_Cu7BVnYJYphTQ,
 timestamp=1351790781154800)
  (column=hash,
 value=kao2EV8jw_wN4EBoMkCXZWCwg3qQ0X6m9_X9JIGkEkiGKJE_JeKgkdoTAkAefXgGtyhChuhWPlWMxl_tX7VZUw,
 timestamp=1351790781154800)
  (column=icon, value=text_txt, timestamp=1351790781154800)
  (column=is_dir, value=false, timestamp=1351790781154800)
  (column=mime_type, value=text/plain, timestamp=1351790781154800)
  (column=mtime, value=1351378576, timestamp=1351790781154800)
  (column=name, value=/Documents/VIMDocument.txt,
 timestamp=1351790781154800)
  (column=revision, value=40b7ae4e-2449-11e2-8610-e0db550199f4,
 timestamp=1351790781154800)
  (column=size, value=13, timestamp=1351790781154800)
  (column=thumb_exists, value=false, timestamp=1351790781154800))


 I don't think their size is an issue here.

 André

 On Nov 8, 2012, at 6:04 PM, Andrey Ilinykh ailin...@gmail.com wrote:

 What is the size of columns? Probably those two are huge.


 On Thu, Nov 8, 2012 at 4:01 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 7, 2012, at 12:15 PM, André Cruz andre.c...@co.sapo.pt wrote:

  This error also happens on my application that uses pycassa, so I don't
 think this is the same bug.

 I have narrowed it down to a slice between two consecutive columns.
 Observe this behaviour using pycassa:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('13957152-234b-11e2-92bc-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:55:51,170 pycassa_library.pool:30 6849
 139928791262976 Connection 52905488 (xxx:9160) was checked out from pool
 51715344
 DEBUG 2012-11-08 11:55:53,415 pycassa_library.pool:37 6849
 139928791262976 Connection 52905488 (xxx:9160) was checked in to pool
 51715344
 [UUID('13957152-234b-11e2-92bc-e0db550199f4'),
 UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')]

 A two column slice took more than 2s to return. If I request the next 2
 column slice:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:57:32,750 pycassa_library.pool:30 6849
 139928791262976 Connection 52904912 (xxx:9160) was checked out from pool
 51715344
 DEBUG 2012-11-08 11:57:32,774 pycassa_library.pool:37 6849
 139928791262976 Connection 52904912 (xxx:9160) was checked in to pool
 51715344
 [UUID('40b7ae4e-2449-11e2-8610-e0db550199f4'),
 UUID('a364b028-2449-11e2-8882-e0db550199f4')]

 This takes 20msec... Is there a rational explanation for this different
 behaviour? Is there some threshold that I'm running into? Is there any way
 to obtain more debugging information about this problem?

 Thanks,
 André






Re: Strange delay in query

2012-11-07 Thread André Cruz
On Nov 7, 2012, at 2:12 AM, Chuan-Heng Hsiao hsiao.chuanh...@gmail.com wrote:

 I assume you are using cassandra-cli and connecting to some specific node.
 
 You can check the following steps:
 
 1. Can you still reproduce this issue? (not - maybe the system/node issue)

Yes. I can reproduce this issue on all 3 nodes. Also, I have a replication 
factor of 3.


 2. What's the result when query without limit?


This row has 600k columns. I issued a count, and after some 10s:

[disco@Disco] count NamespaceRevision[3cd88d97-ffde-44ca-8ae9-5336caaebc4e];
609054 columns


 3. What's the result after doing nodetool repair -pr on that
 particular column family and that node?

I already issued a nodetool repair on all nodes, nothing changed. Would your 
command be any different?


 btw, there seems to be some minor bug in the 1.1.5 cassandra-cli (but
 not in 1.1.6).

This error also happens on my application that uses pycassa, so I don't think 
this is the same bug.


Thanks for the help!

André

Re: Strange delay in query

2012-11-06 Thread Chuan-Heng Hsiao
Hi Andre,

I am just a cassandra user, the following suggestions may not be valid.

I assume you are using cassandra-cli and connecting to some specific node.

You can check the following steps:

1. Can you still reproduce this issue? (not - maybe the system/node issue)
2. What's the result when query without limit?
3. What's the result after doing nodetool repair -pr on that
particular column family and that node?

btw, there seems to be some minor bug in the 1.1.5 cassandra-cli (but
not in 1.1.6).
I got error msg after creating an empty keyspace and updating the
replication factor as 3 across 3-4 nodes.
but when I showed the schema again, the result was correct (including
replication factor).

Sincerely,
Hsiao


On Wed, Nov 7, 2012 at 8:34 AM, André Cruz andre.c...@co.sapo.pt wrote:
 Can anyone shed some light on this matter, please? I don't want to just 
 increase the timeout without understanding why this is happening. Some 
 pointer for me to investigate would be helpful.

 I'm running Cassandra 1.1.5 and these are wide rows (lots of small columns). 
 I would think that fetching the first 34 columns would be fast, and just a 
 little bit slower than 33 columns, but this is a big difference.

 Thank you and best regards,
 André Cruz

 On Nov 6, 2012, at 2:43 PM, André Cruz andre.c...@co.sapo.pt wrote:

 Hello.

 I have a SCF that is acting strange. See these 2 query times:


 get NamespaceRevision[3cd88d97-ffde-44ca-8ae9-5336caaebc4e] limit 33;
 ...
 Returned 33 results.
 Elapsed time: 41 msec(s).

 get NamespaceRevision[3cd88d97-ffde-44ca-8ae9-5336caaebc4e] limit 34;
 ...
 Returned 34 results.
 Elapsed time: 3569 msec(s).


 What can be the cause of this delay? I have a 3 node cluster with a 
 replication factor of 3, so all of the nodes should have a copy of the data.

 - describe cluster;
 Cluster Information:
   Snitch: org.apache.cassandra.locator.PropertyFileSnitch
   Partitioner: org.apache.cassandra.dht.RandomPartitioner
   Schema versions:
   a354e01a-d342-3755-9821-c550dcd1caba: [zzz, yyy, xxx]


 Is there more information that I can provide?

 Best regards,
 André Cruz



Re: How does Cassandra optimize this query?

2012-11-05 Thread Sylvain Lebresne
On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Is this query the equivalent of a full table scan?  Without a starting
 point get_range_slice is just starting at token 0?


It is, but that's what you asked for after all. If you want to start at a
given token you can do:
  SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
'whatevertokenyouwant'
You can even do:
  SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
token(99051fe9-6a9c-46c2-b949-38ef78858dd0)
if that's simpler for you than computing the token manually. Though that is
mostly for random partitioners. For ordered ones, you can do without the
token() part.

--
Sylvain


Re: How does Cassandra optimize this query?

2012-11-05 Thread Edward Capriolo
I see. It is fairly misleading because it is a query that does not
work at scale. This syntax is only helpful if you have less then a few
thousand rows in Cassandra.

On Mon, Nov 5, 2012 at 12:24 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Is this query the equivalent of a full table scan?  Without a starting
 point get_range_slice is just starting at token 0?


 It is, but that's what you asked for after all. If you want to start at a
 given token you can do:
   SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
 'whatevertokenyouwant'
 You can even do:
   SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
 token(99051fe9-6a9c-46c2-b949-38ef78858dd0)
 if that's simpler for you than computing the token manually. Though that is
 mostly for random partitioners. For ordered ones, you can do without the
 token() part.

 --
 Sylvain


Re: How does Cassandra optimize this query?

2012-11-05 Thread Sylvain Lebresne
On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I see. It is fairly misleading because it is a query that does not
 work at scale. This syntax is only helpful if you have less then a few
 thousand rows in Cassandra.


Just for the sake of argument, how is that misleading? If you have billions
of rows and do the select statement from you initial mail, what did the
syntax lead you to believe it would return?

A remark like maybe we just shouldn't allow that and leave that to the
map-reduce side would make sense, but I don't see how this is misleading.

But again, this translate directly to a get_range_slice (that don't scale
if you have billion of rows and don't limit the output either) so there is
nothing new here.


Re: How does Cassandra optimize this query?

2012-11-05 Thread Edward Capriolo
 A remark like maybe we just shouldn't allow that and leave that to the
 map-reduce side would make sense, but I don't see how this is misleading.

Yes. Bingo.

It is misleading because it is not useful in any other context besides
someone playing around with a ten row table in cqlsh. CQL stops me
from executing some queries that are not efficient, yet it allows this
one. If I am new to Cassandra and developing, this query works and
produces a result then once my database gets real data produces a
different result (likely an empty one).

When I first saw this query two things came to my mind.

1) CQL (and Cassandra) must be somehow indexing all the fields of a
primary key to make this search optimal.

2) This is impossible CQL must be gathering the first hundred random
rows and finding this thing.

What it is happening is case #2. In a nutshell CQL is just sampling
some data and running the query on it. We could support all types of
query constructs if we just take the first 100 rows and apply this
logic to it, but these things are not helpful for anything but light
ad-hoc data exploration.

My suggestions:
1) force people to supply a LIMIT clause on any query that is going to
page over get_range_slice
2) having some type of explain support so I can establish if this
query will work in the

I say this because as an end user I do not understand if a given query
is actually going to return the same results with different data.

On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote:

 On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 I see. It is fairly misleading because it is a query that does not
 work at scale. This syntax is only helpful if you have less then a few
 thousand rows in Cassandra.


 Just for the sake of argument, how is that misleading? If you have billions
 of rows and do the select statement from you initial mail, what did the
 syntax lead you to believe it would return?

 A remark like maybe we just shouldn't allow that and leave that to the
 map-reduce side would make sense, but I don't see how this is misleading.

 But again, this translate directly to a get_range_slice (that don't scale if
 you have billion of rows and don't limit the output either) so there is
 nothing new here.


Re: How does Cassandra optimize this query?

2012-11-05 Thread Sylvain Lebresne
Ok, I slightly misunderstood your initial complain, my bad. I largely agree
with you, though I'm more conflicted on what the right resolution is. But
I'll follow up on the ticket to avoid repetition.


On Mon, Nov 5, 2012 at 10:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I created https://issues.apache.org/jira/browse/CASSANDRA-4915

 On Mon, Nov 5, 2012 at 3:27 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
  A remark like maybe we just shouldn't allow that and leave that to the
  map-reduce side would make sense, but I don't see how this is
 misleading.
 
  Yes. Bingo.
 
  It is misleading because it is not useful in any other context besides
  someone playing around with a ten row table in cqlsh. CQL stops me
  from executing some queries that are not efficient, yet it allows this
  one. If I am new to Cassandra and developing, this query works and
  produces a result then once my database gets real data produces a
  different result (likely an empty one).
 
  When I first saw this query two things came to my mind.
 
  1) CQL (and Cassandra) must be somehow indexing all the fields of a
  primary key to make this search optimal.
 
  2) This is impossible CQL must be gathering the first hundred random
  rows and finding this thing.
 
  What it is happening is case #2. In a nutshell CQL is just sampling
  some data and running the query on it. We could support all types of
  query constructs if we just take the first 100 rows and apply this
  logic to it, but these things are not helpful for anything but light
  ad-hoc data exploration.
 
  My suggestions:
  1) force people to supply a LIMIT clause on any query that is going to
  page over get_range_slice
  2) having some type of explain support so I can establish if this
  query will work in the
 
  I say this because as an end user I do not understand if a given query
  is actually going to return the same results with different data.
 
  On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
  I see. It is fairly misleading because it is a query that does not
  work at scale. This syntax is only helpful if you have less then a few
  thousand rows in Cassandra.
 
 
  Just for the sake of argument, how is that misleading? If you have
 billions
  of rows and do the select statement from you initial mail, what did the
  syntax lead you to believe it would return?
 
  A remark like maybe we just shouldn't allow that and leave that to the
  map-reduce side would make sense, but I don't see how this is
 misleading.
 
  But again, this translate directly to a get_range_slice (that don't
 scale if
  you have billion of rows and don't limit the output either) so there is
  nothing new here.



How does Cassandra optimize this query?

2012-11-04 Thread Edward Capriolo
If we create a column family:

CREATE TABLE videos (
  videoid uuid,
  videoname varchar,
  username varchar,
  description varchar,
  tags varchar,
  upload_date timestamp,
  PRIMARY KEY (videoid,videoname)
);

The CLI views this column like so:

create column family videos
  with column_type = 'Standard'
  and comparator =
'CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UUIDType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and compression_options = {'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};

[default@videos] list videos;
Using default limit of 100
Using default column limit of 100
---
RowKey: b3a76c6b-7c7f-4af6-964f-803a9283c401
= (column=Now my dog plays piano!:description, value=My dog learned
to play the piano b
ecause of the cat., timestamp=135205828907)
= (column=Now my dog plays piano!:tags, value=dogs,piano,lol,
timestamp=1352058289070001)
invalid UTF8 bytes 0139794c30c0

SELECT * FROM videos WHERE videoname = 'My funny cat';

 videoid  | videoname| description
  | tags   | u
pload_date  | username
--+--+---++--
+--
 99051fe9-6a9c-46c2-b949-38ef78858dd0 | My funny cat | My cat likes to
play the piano! So funny. | cats,piano,lol | 2
012-06-01 08:00:00+ |ctodd


CQL3 Allows me to search the second component of a primary key. Which
really just seems to be component 1 of a composite column.

So what thrift operation does this correspond to? This looks like a
column slice without specifying a key? How does this work internally?


Re: How does Cassandra optimize this query?

2012-11-04 Thread Sylvain Lebresne
On Sun, Nov 4, 2012 at 7:49 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 CQL3 Allows me to search the second component of a primary key. Which
 really just seems to be component 1 of a composite column.

 So what thrift operation does this correspond to? This looks like a
 column slice without specifying a key? How does this work internally?


get_range_slice (with the right slice predicate to select the columns where
the first component == 'My funny cat')

--
Sylvain


<    4   5   6   7   8   9   10   11   12   >