Re: Large number of row keys in query kills cluster

2014-06-11 Thread Peter Sanford
On Wed, Jun 11, 2014 at 10:12 AM, Jeremy Jongsma 
wrote:

> The big problem seems to have been requesting a large number of row keys
> combined with a large number of named columns in a query. 20K rows with 20K
> columns destroyed my cluster. Splitting it into slices of 100 sequential
> queries fixed the performance issue.
>
> When updating 20K rows at a time, I saw a different issue -
> BrokenPipeException from all nodes. Splitting into slices of 1000 fixed
> that issue.
>
> Is there any documentation on this? Obviously these limits will vary by
> cluster capacity, but for new users it would be great to know that you can
> run into problems with large queries, and how they present themselves when
> you hit them. The errors I saw are pretty opaque, and took me a couple days
> to track down.
>
>
The first thing that comes to mind is the Multiget section on the Datastax
anti-patterns page:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningAntiPatterns_c.html?scroll=concept_ds_emm_hwl_fk__multiple-gets



-psanford


Re: RPC timeout paging secondary index query results

2014-06-11 Thread Robert Coli
On Wed, Jun 11, 2014 at 12:43 PM, Phil Luckhurst <
phil.luckhu...@powerassure.com> wrote:

> It just seems that what we are trying to do here is
> such basic functionality of an index that I thought we must be doing
> something wrong for it to appear to be this broken.
>

To be clear, I did not read or assess your issue, it may in fact be a bug
in Secondary Indexes. If you can repro reliably, I would file a JIRA to
determine if it is or not..

=Rob


Re: RPC timeout paging secondary index query results

2014-06-11 Thread Phil Luckhurst
Thanks Rob.

I understand that we will probably end up either creating our own index or
duplicating the data and we have done that to remove a reliance on secondary
indexes in other places. It just seems that what we are trying to do here is
such basic functionality of an index that I thought we must be doing
something wrong for it to appear to be this broken.

Phil



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RPC-timeout-paging-secondary-index-query-results-tp7595078p7595092.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: RPC timeout paging secondary index query results

2014-06-11 Thread DuyHai Doan
I like the "- Provides the illusion that you are using a RDBMS." part ;-)


On Wed, Jun 11, 2014 at 8:52 PM, Robert Coli  wrote:

> On Wed, Jun 11, 2014 at 2:24 AM, Phil Luckhurst <
> phil.luckhu...@powerassure.com> wrote:
>
>> Is paging through the results of a secondary index query broken in
>> Cassandra
>> 2.0.7 or are we doing something wrong?
>>
>
> General feedback on questions of this type :
>
>
> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201405.mbox/%3CCAEDUwd1i2BwJ-PAFE1qhjQFZ=qz2va_vxwo_jdycms8evkb...@mail.gmail.com%3E
>
> =Rob
>
>


Re: RPC timeout paging secondary index query results

2014-06-11 Thread Robert Coli
On Wed, Jun 11, 2014 at 2:24 AM, Phil Luckhurst <
phil.luckhu...@powerassure.com> wrote:

> Is paging through the results of a secondary index query broken in
> Cassandra
> 2.0.7 or are we doing something wrong?
>

General feedback on questions of this type :

http://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201405.mbox/%3CCAEDUwd1i2BwJ-PAFE1qhjQFZ=qz2va_vxwo_jdycms8evkb...@mail.gmail.com%3E

=Rob


Re: Large number of row keys in query kills cluster

2014-06-11 Thread Robert Coli
On Wed, Jun 11, 2014 at 10:12 AM, Jeremy Jongsma 
wrote:

> Is there any documentation on this? Obviously these limits will vary by
> cluster capacity, but for new users it would be great to know that you can
> run into problems with large queries, and how they present themselves when
> you hit them. The errors I saw are pretty opaque, and took me a couple days
> to track down.
>

All operations in Cassandra are subject to timeouts denominated in seconds,
defaulting to 10 seconds or less. This strongly suggests that operations
which, for example, operate on 20,000 * 20,000 objects (400Mn) have a
meaningful risk of failure, as they are difficult to accomplish within 10
seconds or less. Lunch is still not free.

In fairness, CQL adds another non-helpful layer of opacity here; but what
you get for it is accessibility and ease of first use.


> In any case this seems like a bug to me - it shouldn't be possible to
> completely lock up a cluster with a valid query that isn't doing a table
> scan, should it?
>

There's lots of valid SQL queries which will "lock up" your server, for
some values of "lock up"?

=Rob


Re: Large number of row keys in query kills cluster

2014-06-11 Thread Jeremy Jongsma
The big problem seems to have been requesting a large number of row keys
combined with a large number of named columns in a query. 20K rows with 20K
columns destroyed my cluster. Splitting it into slices of 100 sequential
queries fixed the performance issue.

When updating 20K rows at a time, I saw a different issue -
BrokenPipeException from all nodes. Splitting into slices of 1000 fixed
that issue.

Is there any documentation on this? Obviously these limits will vary by
cluster capacity, but for new users it would be great to know that you can
run into problems with large queries, and how they present themselves when
you hit them. The errors I saw are pretty opaque, and took me a couple days
to track down.

In any case this seems like a bug to me - it shouldn't be possible to
completely lock up a cluster with a valid query that isn't doing a table
scan, should it?


On Wed, Jun 11, 2014 at 9:33 AM, Jeremy Jongsma  wrote:

> I'm using Astyanax with a query like this:
>
> clusterContext
>   .getClient()
>   .getKeyspace("instruments")
>   .prepareQuery(INSTRUMENTS_CF)
>   .setConsistencyLevel(ConsistencyLevel.CL_LOCAL_QUORUM)
>   .getKeySlice(new String[] {
> "ROW1",
> "ROW2",
> // 20,000 keys here...
> "ROW2"
>   })
>   .execute();
>
> At the time this query executes the first time (resulting in unresponsive
> cluster), there are zero rows in the column family. Schema is below, pretty
> basic:
>
> CREATE KEYSPACE instruments WITH replication = {
>   'class': 'NetworkTopologyStrategy',
>   'aws-us-east-1': '2'
> };
>
> CREATE TABLE instruments (
>   key bigint PRIMARY KEY,
>   definition blob,
>   id bigint,
>   name text,
>   symbol text,
>   updated bigint
> ) WITH COMPACT STORAGE AND
>   bloom_filter_fp_chance=0.01 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.00 AND
>   gc_grace_seconds=864000 AND
>   read_repair_chance=0.10 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'SnappyCompressor'};
>
>
>
>
> On Tue, Jun 10, 2014 at 6:35 PM, Laing, Michael  > wrote:
>
>> Perhaps if you described both the schema and the query in more detail, we
>> could help... e.g. did the query have an IN clause with 2 keys? Or is
>> the key compound? More detail will help.
>>
>>
>> On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma 
>> wrote:
>>
>>> I didn't explain clearly - I'm not requesting 20000 unknown keys
>>> (resulting in a full scan), I'm requesting 2 specific rows by key.
>>> On Jun 10, 2014 6:02 PM, "DuyHai Doan"  wrote:
>>>
>>>> Hello Jeremy
>>>>
>>>> Basically what you are doing is to ask Cassandra to do a distributed
>>>> full scan on all the partitions across the cluster, it's normal that the
>>>> nodes are somehow stressed.
>>>>
>>>> How did you make the query? Are you using Thrift or CQL3 API?
>>>>
>>>> Please note that there is another way to get all partition keys :
>>>> SELECT DISTINCT  FROM..., more details here :
>>>> www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
>>>> I ran an application today that attempted to fetch 20,000+ unique row
>>>> keys in one query against a set of completely empty column families. On a
>>>> 4-node cluster (EC2 m1.large instances) with the recommended memory
>>>> settings (2 GB heap), every single node immediately ran out of memory and
>>>> became unresponsive, to the point where I had to kill -9 the cassandra
>>>> processes.
>>>>
>>>> Now clearly this query is not the best idea in the world, but the
>>>> effects of it are a bit disturbing. What could be going on here? Are there
>>>> any other query pitfalls I should be aware of that have the potential to
>>>> explode the entire cluster?
>>>>
>>>> -j
>>>>
>>>
>>
>


Re: Large number of row keys in query kills cluster

2014-06-11 Thread Jeremy Jongsma
I'm using Astyanax with a query like this:

clusterContext
  .getClient()
  .getKeyspace("instruments")
  .prepareQuery(INSTRUMENTS_CF)
  .setConsistencyLevel(ConsistencyLevel.CL_LOCAL_QUORUM)
  .getKeySlice(new String[] {
"ROW1",
"ROW2",
// 20,000 keys here...
"ROW20000"
  })
  .execute();

At the time this query executes the first time (resulting in unresponsive
cluster), there are zero rows in the column family. Schema is below, pretty
basic:

CREATE KEYSPACE instruments WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'aws-us-east-1': '2'
};

CREATE TABLE instruments (
  key bigint PRIMARY KEY,
  definition blob,
  id bigint,
  name text,
  symbol text,
  updated bigint
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};




On Tue, Jun 10, 2014 at 6:35 PM, Laing, Michael 
wrote:

> Perhaps if you described both the schema and the query in more detail, we
> could help... e.g. did the query have an IN clause with 2 keys? Or is
> the key compound? More detail will help.
>
>
> On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma 
> wrote:
>
>> I didn't explain clearly - I'm not requesting 2 unknown keys
>> (resulting in a full scan), I'm requesting 2 specific rows by key.
>> On Jun 10, 2014 6:02 PM, "DuyHai Doan"  wrote:
>>
>>> Hello Jeremy
>>>
>>> Basically what you are doing is to ask Cassandra to do a distributed
>>> full scan on all the partitions across the cluster, it's normal that the
>>> nodes are somehow stressed.
>>>
>>> How did you make the query? Are you using Thrift or CQL3 API?
>>>
>>> Please note that there is another way to get all partition keys : SELECT
>>> DISTINCT  FROM..., more details here :
>>> www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
>>> I ran an application today that attempted to fetch 20,000+ unique row
>>> keys in one query against a set of completely empty column families. On a
>>> 4-node cluster (EC2 m1.large instances) with the recommended memory
>>> settings (2 GB heap), every single node immediately ran out of memory and
>>> became unresponsive, to the point where I had to kill -9 the cassandra
>>> processes.
>>>
>>> Now clearly this query is not the best idea in the world, but the
>>> effects of it are a bit disturbing. What could be going on here? Are there
>>> any other query pitfalls I should be aware of that have the potential to
>>> explode the entire cluster?
>>>
>>> -j
>>>
>>
>


RPC timeout paging secondary index query results

2014-06-11 Thread Phil Luckhurst
Is paging through the results of a secondary index query broken in Cassandra
2.0.7 or are we doing something wrong?

We have table with a few hundred thousand records and an indexed
low-cardinality column. The relevant bits of the table definition are shown
below

CREATE TABLE measurement (
measurement_id uuid,
controller_id uuid,
...
PRIMARY KEY (measurement_id)
);

CREATE INDEX ON measurement(controller_id);

We originally got the timeout when trying to page through the results of a
'SELECT * FROM measurement WHERE controller_id = xxx-xxx-xxx' query using
the java driver 2.0.2 but we can also consistently reproduce the problem
with CQLSH.

In CQLSH we can start paging through the measurement_id entries a 1000 at a
time for a specific controller_id by using the token() method, e.g.

SELECT measurement_id, token(measurement_id) FROM measurement WHERE
controller_id = 0167bfa6-0918-47ba-8b65-dcccecbcd79f AND
token(measurement_id) >= -8975013189301561463 LIMIT 1000;

This works for 8 queries but consistently fails with an RPC timeout for rows
8000-9000. If from row 8000 we start using a smaller LIMIT size we can get
to approx row 8950 but at that point we get the timeout even if we set
'LIMIT 10'. Looking at the trace output it seems to be seems to be doing
thousands of queries on the index table for every request even if we set
'LIMIT 1' - almost as if it's starting from the beginning of the index for
each page request?

It all seems very similar to  CASSANDRA-5975
<https://issues.apache.org/jira/browse/CASSANDRA-5975>   but that is marked
as resolved in Cassandra 2.0.1. For example this query for a single record

SELECT measurement_id, token(measurement_id) FROM measurement WHERE
controller_id = 0167bfa6-0918-47ba-8b65-dcccecbcd79f AND
token(measurement_id) = -8947401969768490998;

works fine and produces approx 60 lines of trace output. If we simply add
'LIMIT 1' to the statement the trace output is approx 70,000 lines!

It looks like we may have to give up on using secondary indexes but it would
be nice to know if what we are trying to do is correct and should work.

Thanks
Phil









--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RPC-timeout-paging-secondary-index-query-results-tp7595078.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Large number of row keys in query kills cluster

2014-06-10 Thread Laing, Michael
Perhaps if you described both the schema and the query in more detail, we
could help... e.g. did the query have an IN clause with 2 keys? Or is
the key compound? More detail will help.


On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma  wrote:

> I didn't explain clearly - I'm not requesting 2 unknown keys
> (resulting in a full scan), I'm requesting 2 specific rows by key.
> On Jun 10, 2014 6:02 PM, "DuyHai Doan"  wrote:
>
>> Hello Jeremy
>>
>> Basically what you are doing is to ask Cassandra to do a distributed full
>> scan on all the partitions across the cluster, it's normal that the nodes
>> are somehow stressed.
>>
>> How did you make the query? Are you using Thrift or CQL3 API?
>>
>> Please note that there is another way to get all partition keys : SELECT
>> DISTINCT  FROM..., more details here :
>> www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
>> I ran an application today that attempted to fetch 20,000+ unique row
>> keys in one query against a set of completely empty column families. On a
>> 4-node cluster (EC2 m1.large instances) with the recommended memory
>> settings (2 GB heap), every single node immediately ran out of memory and
>> became unresponsive, to the point where I had to kill -9 the cassandra
>> processes.
>>
>> Now clearly this query is not the best idea in the world, but the effects
>> of it are a bit disturbing. What could be going on here? Are there any
>> other query pitfalls I should be aware of that have the potential to
>> explode the entire cluster?
>>
>> -j
>>
>


Re: Large number of row keys in query kills cluster

2014-06-10 Thread Jeremy Jongsma
I didn't explain clearly - I'm not requesting 2 unknown keys (resulting
in a full scan), I'm requesting 2 specific rows by key.
On Jun 10, 2014 6:02 PM, "DuyHai Doan"  wrote:

> Hello Jeremy
>
> Basically what you are doing is to ask Cassandra to do a distributed full
> scan on all the partitions across the cluster, it's normal that the nodes
> are somehow stressed.
>
> How did you make the query? Are you using Thrift or CQL3 API?
>
> Please note that there is another way to get all partition keys : SELECT
> DISTINCT  FROM..., more details here :
> www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
> I ran an application today that attempted to fetch 20,000+ unique row keys
> in one query against a set of completely empty column families. On a 4-node
> cluster (EC2 m1.large instances) with the recommended memory settings (2 GB
> heap), every single node immediately ran out of memory and became
> unresponsive, to the point where I had to kill -9 the cassandra processes.
>
> Now clearly this query is not the best idea in the world, but the effects
> of it are a bit disturbing. What could be going on here? Are there any
> other query pitfalls I should be aware of that have the potential to
> explode the entire cluster?
>
> -j
>


Re: Large number of row keys in query kills cluster

2014-06-10 Thread DuyHai Doan
Hello Jeremy

Basically what you are doing is to ask Cassandra to do a distributed full
scan on all the partitions across the cluster, it's normal that the nodes
are somehow stressed.

How did you make the query? Are you using Thrift or CQL3 API?

Please note that there is another way to get all partition keys : SELECT
DISTINCT  FROM..., more details here :
www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
I ran an application today that attempted to fetch 20,000+ unique row keys
in one query against a set of completely empty column families. On a 4-node
cluster (EC2 m1.large instances) with the recommended memory settings (2 GB
heap), every single node immediately ran out of memory and became
unresponsive, to the point where I had to kill -9 the cassandra processes.

Now clearly this query is not the best idea in the world, but the effects
of it are a bit disturbing. What could be going on here? Are there any
other query pitfalls I should be aware of that have the potential to
explode the entire cluster?

-j


Large number of row keys in query kills cluster

2014-06-10 Thread Jeremy Jongsma
I ran an application today that attempted to fetch 20,000+ unique row keys
in one query against a set of completely empty column families. On a 4-node
cluster (EC2 m1.large instances) with the recommended memory settings (2 GB
heap), every single node immediately ran out of memory and became
unresponsive, to the point where I had to kill -9 the cassandra processes.

Now clearly this query is not the best idea in the world, but the effects
of it are a bit disturbing. What could be going on here? Are there any
other query pitfalls I should be aware of that have the potential to
explode the entire cluster?

-j


Re: Cannot query secondary index

2014-06-10 Thread Paulo Ricardo Motta Gomes
Our approach for this scenario is to run a hadoop job that periodically
cleans old entries, but I admit it's far from ideal. Would be nice to have
a more native way to perform these kinds of tasks.

There's a legend about a compaction strategy that keeps only the N first
entries of a partition key, but I don't think it was implemented yet, but
if I remember correctly there's a JIRA ticket about it.


On Tue, Jun 10, 2014 at 3:39 PM, Redmumba  wrote:

> Honestly, this has been by far my single biggest obstacle with Cassandra
> for time-based data--cleaning up the old data when the deletion criteria
> (i.e., date) isn't the primary key.  I've asked about a few different
> approaches, but I haven't really seen any feasible options that can be
> implemented easily.  I've seen the following:
>
>1. Use date-based tables, then drop old tables, ala
>"audit_table_20140610", "audit_table_20140609", etc..
>But then I run into the issue of having to query every table--I would
>have to execute queries against every day to get the data, and then merge
>the data myself.  Unless, there's something in the binary driver I'm
>missing, it doesn't sound like this would be practical.
>2. Use a TTL
>But then I have to basically decide on a value that works for
>everything and, if it ever turns out I overestimated, I'm basically SOL,
>because my cluster will be out of space.
>3. Maintain a separate index of days to keys, and use this index as
>the reference for which keys to delete.
>But then this requires maintaining another index and a relatively
>manual delete.
>
> I can't help but feel that I am just way over-engineering this, or that
> I'm missing something basic in my data model.  Except for the last
> approach, I can't help but feel that I'm overlooking something obvious.
>
> Andrew
>
>
> Of course, Jonathan, I'll do my best!
>
> It's an auditing table that, right now, uses a primary key consisting of a
> combination of a combined partition id of the region and the object id, the
> date, and the process ID.  Each event in our system will create anywhere
> from 1-20 rows, for example, and multiple parts of the system might be
> working on the same "object ID".  So the CF is constantly being appended
> to, but reads are rare.
>
> CREATE TABLE audit (
>> id bigint,
>> region ascii,
>> date timestamp,
>> pid int,
>> PRIMARY KEY ((id, region), date, pid)
>> );
>
>
> Data is queried on a specific object ID and region.  Optionally, users can
> restrict their query to a specific date range, which the above data model
> provides.
>
> However, we generate quite a bit of data, and we want a convenient way to
> get rid of the oldest data.  Since our system scales with the time of year,
> we might get 50GB a day during peak, and 5GB of data off peak.  We could
> pick the safest number--let's say, 30 days--and set the TTL using that.
> The problem there is that, most of the year, we'll be using a very small
> percentage of our available space 90% of the year.
>
> What I'd like to be able to do is drop old tables as needed--i.e., let's
> say when we hit 80% load across the cluster (or some such metric that takes
> the cluster-wide load into account), I want to drop the oldest day's
> records until we're under 80%.  That way, we're always using the maximum
> amount of space we can, without having to worry about getting to the point
> where we run out of space cluster-wide.
>
> My thoughts are--we could always make the date part of the primary key,
> but then we'd either a) have to query the entire range of dates, or b) we'd
> have to force a small date range when querying.  What are the penalties?
> Do you have any other suggestions?
>
>
> On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <
> jlacefi...@datastax.com> wrote:
>
>> Hello,
>>
>>   Will you please describe the use case and what you are trying to model.
>>  What are some questions/queries that you would like to serve via
>> Cassandra.  This will help the community help you a little better.
>>
>> Jonathan Lacefield
>> Solutions Architect, DataStax
>> (404) 822 3487
>>  <http://www.linkedin.com/in/jlacefield>
>>
>> <http://www.datastax.com/cassandrasummit14>
>>
>>
>>
>> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba  wrote:
>>
>>> I've been trying to work around using "date-based tables" because I'd
>>> like to avoid the overhead.  It seems, however, that t

Re: Cannot query secondary index

2014-06-10 Thread Redmumba
Honestly, this has been by far my single biggest obstacle with Cassandra
for time-based data--cleaning up the old data when the deletion criteria
(i.e., date) isn't the primary key.  I've asked about a few different
approaches, but I haven't really seen any feasible options that can be
implemented easily.  I've seen the following:

   1. Use date-based tables, then drop old tables, ala
   "audit_table_20140610", "audit_table_20140609", etc..
   But then I run into the issue of having to query every table--I would
   have to execute queries against every day to get the data, and then merge
   the data myself.  Unless, there's something in the binary driver I'm
   missing, it doesn't sound like this would be practical.
   2. Use a TTL
   But then I have to basically decide on a value that works for everything
   and, if it ever turns out I overestimated, I'm basically SOL, because my
   cluster will be out of space.
   3. Maintain a separate index of days to keys, and use this index as the
   reference for which keys to delete.
   But then this requires maintaining another index and a relatively manual
   delete.

I can't help but feel that I am just way over-engineering this, or that I'm
missing something basic in my data model.  Except for the last approach, I
can't help but feel that I'm overlooking something obvious.

Andrew


Of course, Jonathan, I'll do my best!

It's an auditing table that, right now, uses a primary key consisting of a
combination of a combined partition id of the region and the object id, the
date, and the process ID.  Each event in our system will create anywhere
from 1-20 rows, for example, and multiple parts of the system might be
working on the same "object ID".  So the CF is constantly being appended
to, but reads are rare.

CREATE TABLE audit (
> id bigint,
> region ascii,
> date timestamp,
> pid int,
> PRIMARY KEY ((id, region), date, pid)
> );


Data is queried on a specific object ID and region.  Optionally, users can
restrict their query to a specific date range, which the above data model
provides.

However, we generate quite a bit of data, and we want a convenient way to
get rid of the oldest data.  Since our system scales with the time of year,
we might get 50GB a day during peak, and 5GB of data off peak.  We could
pick the safest number--let's say, 30 days--and set the TTL using that.
The problem there is that, most of the year, we'll be using a very small
percentage of our available space 90% of the year.

What I'd like to be able to do is drop old tables as needed--i.e., let's
say when we hit 80% load across the cluster (or some such metric that takes
the cluster-wide load into account), I want to drop the oldest day's
records until we're under 80%.  That way, we're always using the maximum
amount of space we can, without having to worry about getting to the point
where we run out of space cluster-wide.

My thoughts are--we could always make the date part of the primary key, but
then we'd either a) have to query the entire range of dates, or b) we'd
have to force a small date range when querying.  What are the penalties?
Do you have any other suggestions?


On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield 
wrote:

> Hello,
>
>   Will you please describe the use case and what you are trying to model.
>  What are some questions/queries that you would like to serve via
> Cassandra.  This will help the community help you a little better.
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba  wrote:
>
>> I've been trying to work around using "date-based tables" because I'd
>> like to avoid the overhead.  It seems, however, that this is just not going
>> to work.
>>
>> So here's a question--for these date-based tables (i.e., a table per
>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>> auditing data, for example, I'd need to query all 60 tables--can I do that
>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>> way for me to run the same query against all the tables?
>>
>>
>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba  wrote:
>>
>>> Ah, so the secondary indices are really secondary against the primary
>>> key.  That makes sense.
>>>
>>> I'm beginning to see why the whole "date-based table" approach is the
>>> only one I've been able to find... thanks for the quick responses, guys!
>>>
>>>
>>> On Mon, Jun 9, 2014 

Re: Cannot query secondary index

2014-06-09 Thread Redmumba
Of course, Jonathan, I'll do my best!

It's an auditing table that, right now, uses a primary key consisting of a
combination of a combined partition id of the region and the object id, the
date, and the process ID.  Each event in our system will create anywhere
from 1-20 rows, for example, and multiple parts of the system might be
working on the same "object ID".  So the CF is constantly being appended
to, but reads are rare.

CREATE TABLE audit (
> id bigint,
> region ascii,
> date timestamp,
> pid int,
> PRIMARY KEY ((id, region), date, pid)
> );


Data is queried on a specific object ID and region.  Optionally, users can
restrict their query to a specific date range, which the above data model
provides.

However, we generate quite a bit of data, and we want a convenient way to
get rid of the oldest data.  Since our system scales with the time of year,
we might get 50GB a day during peak, and 5GB of data off peak.  We could
pick the safest number--let's say, 30 days--and set the TTL using that.
The problem there is that, most of the year, we'll be using a very small
percentage of our available space 90% of the year.

What I'd like to be able to do is drop old tables as needed--i.e., let's
say when we hit 80% load across the cluster (or some such metric that takes
the cluster-wide load into account), I want to drop the oldest day's
records until we're under 80%.  That way, we're always using the maximum
amount of space we can, without having to worry about getting to the point
where we run out of space cluster-wide.

My thoughts are--we could always make the date part of the primary key, but
then we'd either a) have to query the entire range of dates, or b) we'd
have to force a small date range when querying.  What are the penalties?
Do you have any other suggestions?


On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield 
wrote:

> Hello,
>
>   Will you please describe the use case and what you are trying to model.
>  What are some questions/queries that you would like to serve via
> Cassandra.  This will help the community help you a little better.
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba  wrote:
>
>> I've been trying to work around using "date-based tables" because I'd
>> like to avoid the overhead.  It seems, however, that this is just not going
>> to work.
>>
>> So here's a question--for these date-based tables (i.e., a table per
>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>> auditing data, for example, I'd need to query all 60 tables--can I do that
>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>> way for me to run the same query against all the tables?
>>
>>
>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba  wrote:
>>
>>> Ah, so the secondary indices are really secondary against the primary
>>> key.  That makes sense.
>>>
>>> I'm beginning to see why the whole "date-based table" approach is the
>>> only one I've been able to find... thanks for the quick responses, guys!
>>>
>>>
>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>> michal.michal...@boxever.com> wrote:
>>>
>>>> Secondary indexes internally are just CFs that map the indexed value to
>>>> a row key which that value belongs to, so you can only query these indexes
>>>> using "=", not ">", ">=" etc.
>>>>
>>>> However, your query does not require index *IF* you provide a row key -
>>>> you can use "<" or ">" like you did for the date column, as long as you
>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>> work.
>>>>
>>>> M.
>>>>
>>>> Kind regards,
>>>> Michał Michalski,
>>>> michal.michal...@boxever.com
>>>>
>>>>
>>>> On 9 June 2014 21:18, Redmumba  wrote:
>>>>
>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>> to me, means its not using the secondary index.  Table definition is
>>>>> (snipping out irrelevant parts)...
>>>>>
>>>>> CREATE TABLE audit (
>>>>>> id bigint,
>&g

Re: Cannot query secondary index

2014-06-09 Thread Jonathan Lacefield
Hello,

  Will you please describe the use case and what you are trying to model.
 What are some questions/queries that you would like to serve via
Cassandra.  This will help the community help you a little better.

Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>



On Mon, Jun 9, 2014 at 7:51 PM, Redmumba  wrote:

> I've been trying to work around using "date-based tables" because I'd like
> to avoid the overhead.  It seems, however, that this is just not going to
> work.
>
> So here's a question--for these date-based tables (i.e., a table per
> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
> auditing data, for example, I'd need to query all 60 tables--can I do that
> smoothly?  Or do I have to have 60 different select statements?  Is there a
> way for me to run the same query against all the tables?
>
>
> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba  wrote:
>
>> Ah, so the secondary indices are really secondary against the primary
>> key.  That makes sense.
>>
>> I'm beginning to see why the whole "date-based table" approach is the
>> only one I've been able to find... thanks for the quick responses, guys!
>>
>>
>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>> michal.michal...@boxever.com> wrote:
>>
>>> Secondary indexes internally are just CFs that map the indexed value to
>>> a row key which that value belongs to, so you can only query these indexes
>>> using "=", not ">", ">=" etc.
>>>
>>> However, your query does not require index *IF* you provide a row key -
>>> you can use "<" or ">" like you did for the date column, as long as you
>>> refer to a single row. However, if you don't provide it, it's not going to
>>> work.
>>>
>>> M.
>>>
>>> Kind regards,
>>> Michał Michalski,
>>> michal.michal...@boxever.com
>>>
>>>
>>> On 9 June 2014 21:18, Redmumba  wrote:
>>>
>>>> I have a table with a timestamp column on it; however, when I try to
>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>> to me, means its not using the secondary index.  Table definition is
>>>> (snipping out irrelevant parts)...
>>>>
>>>> CREATE TABLE audit (
>>>>> id bigint,
>>>>> date timestamp,
>>>>> ...
>>>>> PRIMARY KEY (id, date)
>>>>> );
>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>
>>>>
>>>> There are other fields, but they are not relevant to this example.  The
>>>> date is part of the primary key, and I have a secondary index on it.  When
>>>> I run a SELECT against it, I get an error:
>>>>
>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>> FILTERING
>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>>>> ALLOW FILTERING;
>>>>> Request did not complete within rpc_timeout.
>>>>>
>>>>
>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>> running, but can I verify the "health" of the index?
>>>>
>>>
>>>
>>
>


Re: Cannot query secondary index

2014-06-09 Thread Redmumba
I've been trying to work around using "date-based tables" because I'd like
to avoid the overhead.  It seems, however, that this is just not going to
work.

So here's a question--for these date-based tables (i.e., a table per
day/week/month/whatever), how are they queried?  If I keep 60 days worth of
auditing data, for example, I'd need to query all 60 tables--can I do that
smoothly?  Or do I have to have 60 different select statements?  Is there a
way for me to run the same query against all the tables?


On Mon, Jun 9, 2014 at 3:42 PM, Redmumba  wrote:

> Ah, so the secondary indices are really secondary against the primary
> key.  That makes sense.
>
> I'm beginning to see why the whole "date-based table" approach is the only
> one I've been able to find... thanks for the quick responses, guys!
>
>
> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
> michal.michal...@boxever.com> wrote:
>
>> Secondary indexes internally are just CFs that map the indexed value to a
>> row key which that value belongs to, so you can only query these indexes
>> using "=", not ">", ">=" etc.
>>
>> However, your query does not require index *IF* you provide a row key -
>> you can use "<" or ">" like you did for the date column, as long as you
>> refer to a single row. However, if you don't provide it, it's not going to
>> work.
>>
>> M.
>>
>> Kind regards,
>> Michał Michalski,
>> michal.michal...@boxever.com
>>
>>
>> On 9 June 2014 21:18, Redmumba  wrote:
>>
>>> I have a table with a timestamp column on it; however, when I try to
>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>> to me, means its not using the secondary index.  Table definition is
>>> (snipping out irrelevant parts)...
>>>
>>> CREATE TABLE audit (
>>>> id bigint,
>>>> date timestamp,
>>>> ...
>>>> PRIMARY KEY (id, date)
>>>> );
>>>> CREATE INDEX date_idx ON audit (date);
>>>>
>>>
>>> There are other fields, but they are not relevant to this example.  The
>>> date is part of the primary key, and I have a secondary index on it.  When
>>> I run a SELECT against it, I get an error:
>>>
>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>> Bad Request: Cannot execute this query as it might involve data
>>>> filtering and thus may have unpredictable performance. If you want to
>>>> execute this query despite the performance unpredictability, use ALLOW
>>>> FILTERING
>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>>> ALLOW FILTERING;
>>>> Request did not complete within rpc_timeout.
>>>>
>>>
>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>> running, but can I verify the "health" of the index?
>>>
>>
>>
>


Re: Cannot query secondary index

2014-06-09 Thread Redmumba
Ah, so the secondary indices are really secondary against the primary key.
That makes sense.

I'm beginning to see why the whole "date-based table" approach is the only
one I've been able to find... thanks for the quick responses, guys!


On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
michal.michal...@boxever.com> wrote:

> Secondary indexes internally are just CFs that map the indexed value to a
> row key which that value belongs to, so you can only query these indexes
> using "=", not ">", ">=" etc.
>
> However, your query does not require index *IF* you provide a row key -
> you can use "<" or ">" like you did for the date column, as long as you
> refer to a single row. However, if you don't provide it, it's not going to
> work.
>
> M.
>
> Kind regards,
> Michał Michalski,
> michal.michal...@boxever.com
>
>
> On 9 June 2014 21:18, Redmumba  wrote:
>
>> I have a table with a timestamp column on it; however, when I try to
>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>> to me, means its not using the secondary index.  Table definition is
>> (snipping out irrelevant parts)...
>>
>> CREATE TABLE audit (
>>> id bigint,
>>> date timestamp,
>>> ...
>>> PRIMARY KEY (id, date)
>>> );
>>> CREATE INDEX date_idx ON audit (date);
>>>
>>
>> There are other fields, but they are not relevant to this example.  The
>> date is part of the primary key, and I have a secondary index on it.  When
>> I run a SELECT against it, I get an error:
>>
>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>> Bad Request: Cannot execute this query as it might involve data
>>> filtering and thus may have unpredictable performance. If you want to
>>> execute this query despite the performance unpredictability, use ALLOW
>>> FILTERING
>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>> ALLOW FILTERING;
>>> Request did not complete within rpc_timeout.
>>>
>>
>> How can I force it to use the index?  I've seen rebuild_index tasks
>> running, but can I verify the "health" of the index?
>>
>
>


Re: Cannot query secondary index

2014-06-09 Thread Michal Michalski
Secondary indexes internally are just CFs that map the indexed value to a
row key which that value belongs to, so you can only query these indexes
using "=", not ">", ">=" etc.

However, your query does not require index *IF* you provide a row key - you
can use "<" or ">" like you did for the date column, as long as you refer
to a single row. However, if you don't provide it, it's not going to work.

M.

Kind regards,
Michał Michalski,
michal.michal...@boxever.com


On 9 June 2014 21:18, Redmumba  wrote:

> I have a table with a timestamp column on it; however, when I try to query
> based on it, it fails saying that I must use ALLOW FILTERING--which to me,
> means its not using the secondary index.  Table definition is (snipping out
> irrelevant parts)...
>
> CREATE TABLE audit (
>> id bigint,
>> date timestamp,
>> ...
>> PRIMARY KEY (id, date)
>> );
>> CREATE INDEX date_idx ON audit (date);
>>
>
> There are other fields, but they are not relevant to this example.  The
> date is part of the primary key, and I have a secondary index on it.  When
> I run a SELECT against it, I get an error:
>
> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>> Bad Request: Cannot execute this query as it might involve data filtering
>> and thus may have unpredictable performance. If you want to execute this
>> query despite the performance unpredictability, use ALLOW FILTERING
>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>> ALLOW FILTERING;
>> Request did not complete within rpc_timeout.
>>
>
> How can I force it to use the index?  I've seen rebuild_index tasks
> running, but can I verify the "health" of the index?
>


Re: Cannot query secondary index

2014-06-09 Thread Jonathan Lacefield
Hello,

  You are receiving this item because you are not passing in the Partition
Key as part of your query.  Cassandra is telling you it doesn't know which
node to find the data and you haven't explicitly told it to search across
all your nodes for the data.  The ALLOW FILTERING clause bypasses the need
to pass in a partition key in your query.
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

  Big picture, for data modeling in Cassandra, it's advisable to model your
data based on the query access patterns and to duplicate data into tables
that represent your query.  In this case, creating a table with a Partition
Key of date, could benefit you.  Heavy use of ALLOW FILTERING could cause
performance issues within your cluster.

  Also, please be aware that Secondary Indexes are much different in
Cassandra-land compared to indexes in RDBMS-land.  They should be used only
when necessary, i.e. an explicit use case.  Typically, modeling your data
so you can avoid Secondary Indexes will ensure a well preforming system and
queries.

  Here's a good intro to Cassandra data modeling:
https://www.youtube.com/watch?v=HdJlsOZVGwM

  Hope this helps.

Jonathan

Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>



On Mon, Jun 9, 2014 at 5:18 PM, Redmumba  wrote:

> I have a table with a timestamp column on it; however, when I try to query
> based on it, it fails saying that I must use ALLOW FILTERING--which to me,
> means its not using the secondary index.  Table definition is (snipping out
> irrelevant parts)...
>
> CREATE TABLE audit (
>> id bigint,
>> date timestamp,
>> ...
>> PRIMARY KEY (id, date)
>> );
>> CREATE INDEX date_idx ON audit (date);
>>
>
> There are other fields, but they are not relevant to this example.  The
> date is part of the primary key, and I have a secondary index on it.  When
> I run a SELECT against it, I get an error:
>
> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>> Bad Request: Cannot execute this query as it might involve data filtering
>> and thus may have unpredictable performance. If you want to execute this
>> query despite the performance unpredictability, use ALLOW FILTERING
>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>> ALLOW FILTERING;
>> Request did not complete within rpc_timeout.
>>
>
> How can I force it to use the index?  I've seen rebuild_index tasks
> running, but can I verify the "health" of the index?
>


Cannot query secondary index

2014-06-09 Thread Redmumba
I have a table with a timestamp column on it; however, when I try to query
based on it, it fails saying that I must use ALLOW FILTERING--which to me,
means its not using the secondary index.  Table definition is (snipping out
irrelevant parts)...

CREATE TABLE audit (
> id bigint,
> date timestamp,
> ...
> PRIMARY KEY (id, date)
> );
> CREATE INDEX date_idx ON audit (date);
>

There are other fields, but they are not relevant to this example.  The
date is part of the primary key, and I have a secondary index on it.  When
I run a SELECT against it, I get an error:

cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
> Bad Request: Cannot execute this query as it might involve data filtering
> and thus may have unpredictable performance. If you want to execute this
> query despite the performance unpredictability, use ALLOW FILTERING
> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
> ALLOW FILTERING;
> Request did not complete within rpc_timeout.
>

How can I force it to use the index?  I've seen rebuild_index tasks
running, but can I verify the "health" of the index?


Re: backend query of a Cassandra db

2014-05-30 Thread Bobby Chowdary

There are few way you can do this really depends on preferences to have 
separate cluster or use same nodes etc...

1. If you have DSE they have hadoop/hive integrated or you can use Opensouce 
hive handler by tuple jump  https://github.com/tuplejump/cash
2. Spark/Shark : Using Tuplejump Calliope and Cash 
(http://tuplejump.github.io/calliope/ , https://github.com/tuplejump/cash)  you 
can refer to Brain ONeil's Blog 
here..http://brianoneill.blogspot.com/2014/03/shark-on-cassandra-w-cash-interrogating.html
 , http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
3. PrestoDB http://prestodb.io/ 

Thanks
Bobby


On May 30, 2014, at 12:09 PM, "cbert...@libero.it"  wrote:

Hello,
I have a working cluster of Cassandra that performs very well on a high 
traffic web application. 
Now I need to build a backend web application to query Cassandra on many non 
indexed columns ... what is the best way to do that? Apache hive? Pig?


Cassandra 2 


Thanks


backend query of a Cassandra db

2014-05-30 Thread cbert...@libero.it
Hello,
I have a working cluster of Cassandra that performs very well on a high 
traffic web application. 
Now I need to build a backend web application to query Cassandra on many non 
indexed columns ... what is the best way to do that? Apache hive? Pig?

Cassandra 2 

Thanks


Re: Possible to Add multiple columns in one query ?

2014-05-25 Thread Colin
Also, make sure you're using prepared statements.

--
Colin
320-221-9531


> On May 25, 2014, at 1:56 PM, "Jack Krupansky"  wrote:
> 
> Typo: I presume “channelid” should be “tagid” for the partition key for your 
> table.
>  
> Yes, BATCH statements are the way to go, but be careful not to make your 
> batches too large, otherwise you could lose performance when Cassandra is 
> relatively idle while the batch is slowly streaming in to the coordinator 
> node over the network. Better to break up a large batch into multiple 
> moderate size batches (exact size and number will vary and need testing to 
> deduce) that will transmit quicker and can be executed in parallel.
>  
> I’m not sure Cassandra on a laptop would be the best measure of performance 
> for a real cluster, especially compared to a server with more CPU cores than 
> your laptop.
>  
> And for a real cluster, rows with different partition keys can be sent to a 
> coordinator node that owns that partition key, which could be multiple nodes 
> for RF>1.
>  
> -- Jack Krupansky
>  
> From: Mark Farnan
> Sent: Sunday, May 25, 2014 9:36 AM
> To: user@cassandra.apache.org
> Subject: Possible to Add multiple columns in one query ?
>  
> I’m sure this is a  CQL 101 question, but.  
>  
> Is it possible to add MULTIPLE   Rows/Columns  to a single Partition in a 
> single CQL 3  Query / Call. 
>  
> Need:
> I’m trying to find the most efficient way to add multiple time series events 
> to a table in a single call.
> Whilst most time series data comes in sequentially, we have a case where it 
> is often loaded in bulk,  say sent  100,000 points for 50  channels/tags  at 
> one go.  (sometimes more), and this needs to be loaded as quickly and 
> efficiently as possible.
>  
> Fairly standard Time-Series schema (this is for testing purposes only at this 
> point, and doesn’t represent final schemas)
>  
> CREATE TABLE tag (
>   tagid int,
>   idx timestamp,
>   value double,
>   PRIMARY KEY (channelid, idx)
> ) WITH CLUSTERING ORDER BY (idx DESC);
>  
>  
> Currently I’m using Batch statements, but even that is not fast enough.
>  
> Note: At this point I’m testing on a single node cluster on laptop, to 
> compare different versions.
>  
> We are using DataStax C# 2.0 (beta) client. And Cassandra 2.0.7
>  
> Regards
> Mark.


Re: Possible to Add multiple columns in one query ?

2014-05-25 Thread Colin
Try asynch updates, and collect the futures at 1,000 and play around from 
there.  

Also, in the real world, you'd want to use load balancing and token aware 
policies when connecting to the cluster.  This will actually bypass the 
coordinator and write directly to the correct nodes.

I will post a link to my github with an example when I get off the road

--
Colin
320-221-9531


> On May 25, 2014, at 1:56 PM, "Jack Krupansky"  wrote:
> 
> Typo: I presume “channelid” should be “tagid” for the partition key for your 
> table.
>  
> Yes, BATCH statements are the way to go, but be careful not to make your 
> batches too large, otherwise you could lose performance when Cassandra is 
> relatively idle while the batch is slowly streaming in to the coordinator 
> node over the network. Better to break up a large batch into multiple 
> moderate size batches (exact size and number will vary and need testing to 
> deduce) that will transmit quicker and can be executed in parallel.
>  
> I’m not sure Cassandra on a laptop would be the best measure of performance 
> for a real cluster, especially compared to a server with more CPU cores than 
> your laptop.
>  
> And for a real cluster, rows with different partition keys can be sent to a 
> coordinator node that owns that partition key, which could be multiple nodes 
> for RF>1.
>  
> -- Jack Krupansky
>  
> From: Mark Farnan
> Sent: Sunday, May 25, 2014 9:36 AM
> To: user@cassandra.apache.org
> Subject: Possible to Add multiple columns in one query ?
>  
> I’m sure this is a  CQL 101 question, but.  
>  
> Is it possible to add MULTIPLE   Rows/Columns  to a single Partition in a 
> single CQL 3  Query / Call. 
>  
> Need:
> I’m trying to find the most efficient way to add multiple time series events 
> to a table in a single call.
> Whilst most time series data comes in sequentially, we have a case where it 
> is often loaded in bulk,  say sent  100,000 points for 50  channels/tags  at 
> one go.  (sometimes more), and this needs to be loaded as quickly and 
> efficiently as possible.
>  
> Fairly standard Time-Series schema (this is for testing purposes only at this 
> point, and doesn’t represent final schemas)
>  
> CREATE TABLE tag (
>   tagid int,
>   idx timestamp,
>   value double,
>   PRIMARY KEY (channelid, idx)
> ) WITH CLUSTERING ORDER BY (idx DESC);
>  
>  
> Currently I’m using Batch statements, but even that is not fast enough.
>  
> Note: At this point I’m testing on a single node cluster on laptop, to 
> compare different versions.
>  
> We are using DataStax C# 2.0 (beta) client. And Cassandra 2.0.7
>  
> Regards
> Mark.


Re: Possible to Add multiple columns in one query ?

2014-05-25 Thread Jack Krupansky
Typo: I presume “channelid” should be “tagid” for the partition key for your 
table.

Yes, BATCH statements are the way to go, but be careful not to make your 
batches too large, otherwise you could lose performance when Cassandra is 
relatively idle while the batch is slowly streaming in to the coordinator node 
over the network. Better to break up a large batch into multiple moderate size 
batches (exact size and number will vary and need testing to deduce) that will 
transmit quicker and can be executed in parallel.

I’m not sure Cassandra on a laptop would be the best measure of performance for 
a real cluster, especially compared to a server with more CPU cores than your 
laptop.

And for a real cluster, rows with different partition keys can be sent to a 
coordinator node that owns that partition key, which could be multiple nodes 
for RF>1.

-- Jack Krupansky

From: Mark Farnan 
Sent: Sunday, May 25, 2014 9:36 AM
To: user@cassandra.apache.org 
Subject: Possible to Add multiple columns in one query ?

I’m sure this is a  CQL 101 question, but. 

 

Is it possible to add MULTIPLE   Rows/Columns  to a single Partition in a 
single CQL 3  Query / Call.  

 

Need: 

I’m trying to find the most efficient way to add multiple time series events to 
a table in a single call. 

Whilst most time series data comes in sequentially, we have a case where it is 
often loaded in bulk,  say sent  100,000 points for 50  channels/tags  at one 
go.  (sometimes more), and this needs to be loaded as quickly and efficiently 
as possible. 

 

Fairly standard Time-Series schema (this is for testing purposes only at this 
point, and doesn’t represent final schemas) 

 

CREATE TABLE tag (

  tagid int,

  idx timestamp,

  value double,

  PRIMARY KEY (channelid, idx)

) WITH CLUSTERING ORDER BY (idx DESC);

 

 

Currently I’m using Batch statements, but even that is not fast enough. 

 

Note: At this point I’m testing on a single node cluster on laptop, to compare 
different versions.

 

We are using DataStax C# 2.0 (beta) client. And Cassandra 2.0.7

 

Regards

Mark. 


Possible to Add multiple columns in one query ?

2014-05-25 Thread Mark Farnan
I'm sure this is a  CQL 101 question, but. 

 

Is it possible to add MULTIPLE   Rows/Columns  to a single Partition in a
single CQL 3  Query / Call.  

 

Need: 

I'm trying to find the most efficient way to add multiple time series events
to a table in a single call. 

Whilst most time series data comes in sequentially, we have a case where it
is often loaded in bulk,  say sent  100,000 points for 50  channels/tags  at
one go.  (sometimes more), and this needs to be loaded as quickly and
efficiently as possible. 

 

Fairly standard Time-Series schema (this is for testing purposes only at
this point, and doesn't represent final schemas) 

 

CREATE TABLE tag (

  tagid int,

  idx timestamp,

  value double,

  PRIMARY KEY (channelid, idx)

) WITH CLUSTERING ORDER BY (idx DESC);

 

 

Currently I'm using Batch statements, but even that is not fast enough. 

 

Note: At this point I'm testing on a single node cluster on laptop, to
compare different versions.

 

We are using DataStax C# 2.0 (beta) client. And Cassandra 2.0.7

 

Regards

Mark. 



Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-19 Thread Bryan Talbot
I think there are several issues in your schema and queries.

First, the schema can't efficiently return the single newest post for every
author. It can efficiently return the newest N posts for a particular
author.

On Fri, May 16, 2014 at 11:53 PM, 後藤 泰陽  wrote:

>
> But I consider LIMIT to be a keyword to limits result numbers from WHOLE
> results retrieved by the SELECT statement.
>


This is happening due to the incorrect use of minTimeuuid() function. All
of your created_at values are equal so you're essentially getting 2 (order
not defined) values that have the lowest created_at value.

The minTimeuuid() function is mean to be used in the WHERE clause of a
SELECT statement often with maxTimeuuid() to do BETWEEN sort of queries on
timeuuid values.




> The result with SELECT.. LIMIT is below. Unfortunately, This is not what I
> wanted.
> I wante latest posts of each authors. (Now I doubt if CQL3 can't represent
> it)
>
> cqlsh:blog_test> create table posts(
>  ... author ascii,
>  ... created_at timeuuid,
>  ... entry text,
>  ... primary key(author,created_at)
>  ... )WITH CLUSTERING ORDER BY (created_at DESC);
> cqlsh:blog_test>
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
> john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> mike');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
> mike');
> cqlsh:blog_test> select * from posts limit 2;
>
>  author | created_at   | entry
>
> +--+--
>mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
> mike
>mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
> mike
>
>
>
>


To get most recent posts by a particular author, you'll need statements
more like this:

cqlsh:test> insert into posts(author,created_at,entry) values
('john',now(),'This is an old entry by john'); cqlsh:test> insert into
posts(author,created_at,entry) values ('john',now(),'This is a new entry by
john'); cqlsh:test> insert into posts(author,created_at,entry) values
('mike',now(),'This is an old entry by mike'); cqlsh:test> insert into
posts(author,created_at,entry) values ('mike',now(),'This is a new entry by
mike');

and then you can get posts by 'john' ordered by newest to oldest as:

cqlsh:test> select author, created_at, dateOf(created_at), entry from posts
where author = 'john' limit 2 ;

 author | created_at   | dateOf(created_at)   |
entry
+--+--+--
   john | 7cb1ac30-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:36-0700 |
 This is a new entry by john
   john | 74bb6750-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:23-0700 |
This is an old entry by john


-Bryan


Re: Query returns incomplete result

2014-05-19 Thread Aaron Morton
Calling execute the second time runs the query a second time, and it looks like 
the query mutates instance state during the pagination. 

What happens if you only call execute() once ? 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 8/05/2014, at 8:03 pm, Lu, Boying  wrote:

> Hi, All,
>  
> I use the astyanax 1.56.48 + Cassandra 2.0.6 in my test codes and do some 
> query like this:
>  
> query = keyspace.prepareQuery(..).getKey(…)
> .autoPaginate(true)
> .withColumnRange(new RangeBuilder().setLimit(pageSize).build());
>  
> ColumnList result;
> result= query.execute().getResult();
> while (!result.isEmpty()) {
> //handle result here
> result= query.execute().getResult();
> }
>  
> There are 2003 records in the DB, if the pageSize is set to 1100, I get only 
> 2002 records back.
> and if the pageSize is set to 3000, I can get the all 2003 records back.
>  
> Does anyone know why? Is it a bug?
>  
> Thanks
>  
> Boying



Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-17 Thread Matope Ono
Hmm. Something like a user-managed-index looks the only way to do what I
want to do.
Thank you, I'll try that.


2014-05-17 18:07 GMT+09:00 DuyHai Doan :

> Clearly with your current data model, having X latest post for each author
> is not possible.
>
>  However, what's about this ?
>
> CREATE TABLE latest_posts_per_user (
>author ascii
>latest_post map,
>PRIMARY KEY (author)
> )
>
>  The latest_post will keep a collection of X latest posts for each user.
> Now the challenge is to "update" this latest_post map every time an user
> create a new post. This can be done in a single CQL3 statement: UPDATE
> latest_posts_per_user SET latest_post = latest_post + {new_uuid: 'new
> entry', oldest_uuid: null} WHERE author = xxx;
>
>  You'll need to know the uuid of the oldest post to remove it from the map
>
>
>
> On Sat, May 17, 2014 at 8:53 AM, 後藤 泰陽  wrote:
>
>> Hello,
>>
>> Thank you for your addressing.
>>
>> But I consider LIMIT to be a keyword to limits result numbers from WHOLE
>> results retrieved by the SELECT statement.
>> The result with SELECT.. LIMIT is below. Unfortunately, This is not what
>> I wanted.
>> I wante latest posts of each authors. (Now I doubt if CQL3 can't
>> represent it)
>>
>> cqlsh:blog_test> create table posts(
>>  ... author ascii,
>>  ... created_at timeuuid,
>>  ... entry text,
>>  ... primary key(author,created_at)
>>  ... )WITH CLUSTERING ORDER BY (created_at DESC);
>> cqlsh:blog_test>
>> cqlsh:blog_test> insert into posts(author,created_at,entry) values
>> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>> john');
>> cqlsh:blog_test> insert into posts(author,created_at,entry) values
>> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
>> john');
>> cqlsh:blog_test> insert into posts(author,created_at,entry) values
>> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>> mike');
>> cqlsh:blog_test> insert into posts(author,created_at,entry) values
>> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
>> mike');
>> cqlsh:blog_test> select * from posts limit 2;
>>
>>  author | created_at   | entry
>>
>> +--+--
>>mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
>> mike
>>mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
>> mike
>>
>>
>>
>>
>> 2014/05/16 23:54、Jonathan Lacefield  のメール:
>>
>> Hello,
>>
>>  Have you looked at using the CLUSTERING ORDER BY and LIMIT features of
>> CQL3?
>>
>>  These may help you achieve your goals.
>>
>>
>> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html
>>
>> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
>>
>> Jonathan Lacefield
>> Solutions Architect, DataStax
>> (404) 822 3487
>> <http://www.linkedin.com/in/jlacefield>
>>
>> <http://www.datastax.com/cassandrasummit14>
>>
>>
>>
>> On Fri, May 16, 2014 at 12:23 AM, Matope Ono wrote:
>>
>>> Hi, I'm modeling some queries in CQL3.
>>>
>>> I'd like to query first 1 columns for each partitioning keys in CQL3.
>>>
>>> For example:
>>>
>>> create table posts(
>>>> author ascii,
>>>> created_at timeuuid,
>>>> entry text,
>>>> primary key(author,created_at)
>>>> );
>>>> insert into posts(author,created_at,entry) values
>>>> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>>>> john');
>>>> insert into posts(author,created_at,entry) values
>>>> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by 
>>>> john');
>>>> insert into posts(author,created_at,entry) values
>>>> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>>>> mike');
>>>> insert into posts(author,created_at,entry) values
>>>> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by 
>>>> mike');
>>>
>>>
>>> And I want results like below.
>>>
>>> mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
>>>> john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john
>>>
>>>
>>> I think that this is what "SELECT FIRST " statements did in CQL2.
>>>
>>> The only way I came across in CQL3 is "retrieve whole records and drop
>>> manually",
>>> but it's obviously not efficient.
>>>
>>> Could you please tell me more straightforward way in CQL3?
>>>
>>
>>
>>
>


Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-17 Thread DuyHai Doan
Clearly with your current data model, having X latest post for each author
is not possible.

 However, what's about this ?

CREATE TABLE latest_posts_per_user (
   author ascii
   latest_post map,
   PRIMARY KEY (author)
)

 The latest_post will keep a collection of X latest posts for each user.
Now the challenge is to "update" this latest_post map every time an user
create a new post. This can be done in a single CQL3 statement: UPDATE
latest_posts_per_user SET latest_post = latest_post + {new_uuid: 'new
entry', oldest_uuid: null} WHERE author = xxx;

 You'll need to know the uuid of the oldest post to remove it from the map



On Sat, May 17, 2014 at 8:53 AM, 後藤 泰陽  wrote:

> Hello,
>
> Thank you for your addressing.
>
> But I consider LIMIT to be a keyword to limits result numbers from WHOLE
> results retrieved by the SELECT statement.
> The result with SELECT.. LIMIT is below. Unfortunately, This is not what I
> wanted.
> I wante latest posts of each authors. (Now I doubt if CQL3 can't represent
> it)
>
> cqlsh:blog_test> create table posts(
>  ... author ascii,
>  ... created_at timeuuid,
>  ... entry text,
>  ... primary key(author,created_at)
>  ... )WITH CLUSTERING ORDER BY (created_at DESC);
> cqlsh:blog_test>
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
> john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> mike');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
> mike');
> cqlsh:blog_test> select * from posts limit 2;
>
>  author | created_at   | entry
>
> +--+--
>mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
> mike
>mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
> mike
>
>
>
>
> 2014/05/16 23:54、Jonathan Lacefield  のメール:
>
> Hello,
>
>  Have you looked at using the CLUSTERING ORDER BY and LIMIT features of
> CQL3?
>
>  These may help you achieve your goals.
>
>
> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html
>
> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
> <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Fri, May 16, 2014 at 12:23 AM, Matope Ono  wrote:
>
>> Hi, I'm modeling some queries in CQL3.
>>
>> I'd like to query first 1 columns for each partitioning keys in CQL3.
>>
>> For example:
>>
>> create table posts(
>>> author ascii,
>>> created_at timeuuid,
>>> entry text,
>>> primary key(author,created_at)
>>> );
>>> insert into posts(author,created_at,entry) values
>>> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>>> john');
>>> insert into posts(author,created_at,entry) values
>>> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
>>> insert into posts(author,created_at,entry) values
>>> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>>> mike');
>>> insert into posts(author,created_at,entry) values
>>> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');
>>
>>
>> And I want results like below.
>>
>> mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
>>> john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john
>>
>>
>> I think that this is what "SELECT FIRST " statements did in CQL2.
>>
>> The only way I came across in CQL3 is "retrieve whole records and drop
>> manually",
>> but it's obviously not efficient.
>>
>> Could you please tell me more straightforward way in CQL3?
>>
>
>
>


Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-17 Thread 後藤 泰陽
Hello,

Thank you for your addressing.

But I consider LIMIT to be a keyword to limits result numbers from WHOLE 
results retrieved by the SELECT statement.
The result with SELECT.. LIMIT is below. Unfortunately, This is not what I 
wanted.
I wante latest posts of each authors. (Now I doubt if CQL3 can't represent it)

> cqlsh:blog_test> create table posts(
>  ... author ascii,
>  ... created_at timeuuid,
>  ... entry text,
>  ... primary key(author,created_at)
>  ... )WITH CLUSTERING ORDER BY (created_at DESC);
> cqlsh:blog_test> 
> cqlsh:blog_test> insert into posts(author,created_at,entry) values 
> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values 
> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values 
> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by mike');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values 
> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');
> cqlsh:blog_test> select * from posts limit 2;
> 
>  author | created_at   | entry
> +--+--
>mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by mike
>mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by mike



2014/05/16 23:54、Jonathan Lacefield  のメール:

> Hello,
> 
>  Have you looked at using the CLUSTERING ORDER BY and LIMIT features of CQL3?
> 
>  These may help you achieve your goals.
> 
>   
> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html
>   
> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
> 
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
> 
> 
> 
> 
> 
> 
> On Fri, May 16, 2014 at 12:23 AM, Matope Ono  wrote:
> Hi, I'm modeling some queries in CQL3.
> 
> I'd like to query first 1 columns for each partitioning keys in CQL3.
> 
> For example:
> 
> create table posts(
>   author ascii,
>   created_at timeuuid,
>   entry text,
>   primary key(author,created_at)
> );
> insert into posts(author,created_at,entry) values 
> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by john');
> insert into posts(author,created_at,entry) values 
> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
> insert into posts(author,created_at,entry) values 
> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by mike');
> insert into posts(author,created_at,entry) values 
> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');
> 
> And I want results like below.
> 
> mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
> john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john
> 
> I think that this is what "SELECT FIRST " statements did in CQL2.
> 
> The only way I came across in CQL3 is "retrieve whole records and drop 
> manually",
> but it's obviously not efficient.
> 
> Could you please tell me more straightforward way in CQL3?
> 



Query first 1 columns for each partitioning keys in CQL?

2014-05-16 Thread Matope Ono
Hi, I'm modeling some queries in CQL3.

I'd like to query first 1 columns for each partitioning keys in CQL3.

For example:

create table posts(
> author ascii,
> created_at timeuuid,
> entry text,
> primary key(author,created_at)
> );
> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> john');
> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> mike');
> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');


And I want results like below.

mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
> john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john


I think that this is what "SELECT FIRST " statements did in CQL2.

The only way I came across in CQL3 is "retrieve whole records and drop
manually",
but it's obviously not efficient.

Could you please tell me more straightforward way in CQL3?


Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-16 Thread Jonathan Lacefield
Hello,

 Have you looked at using the CLUSTERING ORDER BY and LIMIT features of
CQL3?

 These may help you achieve your goals.


http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html

http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>



On Fri, May 16, 2014 at 12:23 AM, Matope Ono  wrote:

> Hi, I'm modeling some queries in CQL3.
>
> I'd like to query first 1 columns for each partitioning keys in CQL3.
>
> For example:
>
> create table posts(
>> author ascii,
>> created_at timeuuid,
>> entry text,
>> primary key(author,created_at)
>> );
>> insert into posts(author,created_at,entry) values
>> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>> john');
>> insert into posts(author,created_at,entry) values
>> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
>> insert into posts(author,created_at,entry) values
>> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
>> mike');
>> insert into posts(author,created_at,entry) values
>> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');
>
>
> And I want results like below.
>
> mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
>> john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john
>
>
> I think that this is what "SELECT FIRST " statements did in CQL2.
>
> The only way I came across in CQL3 is "retrieve whole records and drop
> manually",
> but it's obviously not efficient.
>
> Could you please tell me more straightforward way in CQL3?
>


Query returns incomplete result

2014-05-15 Thread Lu, Boying
Hi, All,

I use the astyanax 1.56.48 + Cassandra 2.0.6 in my test codes and do some query 
like this:

query = keyspace.prepareQuery(..).getKey(...)
.autoPaginate(true)
.withColumnRange(new RangeBuilder().setLimit(pageSize).build());

ColumnList result;
result= query.execute().getResult();
while (!result.isEmpty()) {
//handle result here
result= query.execute().getResult();
}

There are 2003 records in the DB, if the pageSize is set to 1100, I get only 
2002 records back.
and if the pageSize is set to 3000, I can get the all 2003 records back.

Does anyone know why? Is it a bug?

Thanks

Boying



Node never know a table has been DROP or CREATE if its gossip is disabled while executing this query

2014-04-26 Thread Zhe Yang
Hi all,

I'm using Cassandra 2.0.6 and I have 8 nodes. I'm doing some tests by using
operations below:

disable gossip of node A;
check the status by nodetool in other node, node A is Down now;
use cqlsh connecting an "Up" node and create a table;
enable gossip of node A;
check the status, all nodes are "Up" now.

Then I find that node A doesn't know this table has been created. Both its
own cql shell and nodetool cfstats tell me the table doesn't exist. Even
waiting for a few minutes to get the "eventual consistency" final status,
node A still doesn't know this table.  And I find if each node knows there
is a table but I drop it when one node's gossip is disabled, this node will
never know the table has been dropped.

Is this a bug?

-- 
Regards,
Zhe Yang


cqlsh very strange query results behaviour (Cassandra 2.0.6)

2014-04-20 Thread Jacob Rhoden
This just happened, is this fixed in 2.0.7?

cqlsh:tap> select * from setting;
Bad Request: unconfigured columnfamily settings
cqlsh:tap> select * from settings;

 name | value
--+--
  ldap.userdn |  uid={1},ou=People,dc=example,dc=com
   throttle.auth.attempts |3
...

(32 rows)

cqlsh:tap> select * from profile_counter limit 10;

 name | value
--+--
 throttle.auth.window |   60
  ldap.userdn |  uid={1},ou=People,dc=example,dc=com
...



Re: select query returns wrong value if use DESC option

2014-03-13 Thread Edward Capriolo
Consider filing a jira. Cql is the standard interface to cassandra
everything is heavily tested.
On Thursday, March 13, 2014, Katsutoshi Nagaoka 
wrote:
> Hi.
>
> I am using Cassandra 2.0.6 version. There is a case that select query
returns wrong value if use DESC option. My test procedure is as follows:
>
> --
> cqlsh:test> CREATE TABLE mytable (key int, range int, PRIMARY KEY (key,
range));
> cqlsh:test> INSERT INTO mytable (key, range) VALUES (0, 0);
> cqlsh:test> SELECT * FROM mytable WHERE key = 0 AND range = 0;
>
>  key | range
> -+---
>0 | 0
>
> (1 rows)
>
> cqlsh:test> SELECT * FROM mytable WHERE key = 0 AND range = 0 ORDER BY
range ASC;
>
>  key | range
> -+---
>0 | 0
>
> (1 rows)
>
> cqlsh:test> SELECT * FROM mytable WHERE key = 0 AND range = 0 ORDER BY
range DESC;
>
> (0 rows)
> --
>
> Why returns value is 0 rows if using DESC option? I expected the same 1
row as the return value of other queries. Does anyone has a similar issue?
>
> Thanks,
> Katsutoshi

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


select query returns wrong value if use DESC option

2014-03-13 Thread Katsutoshi Nagaoka
Hi.

I am using Cassandra 2.0.6 version. There is a case that select query
returns wrong value if use DESC option. My test procedure is as follows:

--
cqlsh:test> CREATE TABLE mytable (key int, range int, PRIMARY KEY (key,
range));
cqlsh:test> INSERT INTO mytable (key, range) VALUES (0, 0);
cqlsh:test> SELECT * FROM mytable WHERE key = 0 AND range = 0;

 key | range
-+---
   0 | 0

(1 rows)

cqlsh:test> SELECT * FROM mytable WHERE key = 0 AND range = 0 ORDER BY
range ASC;

 key | range
-+---
   0 | 0

(1 rows)

cqlsh:test> SELECT * FROM mytable WHERE key = 0 AND range = 0 ORDER BY
range DESC;

(0 rows)
--

Why returns value is 0 rows if using DESC option? I expected the same 1 row
as the return value of other queries. Does anyone has a similar issue?

Thanks,
Katsutoshi


How to use CompositeRangeBuilder to query multiple components in a column?

2014-03-09 Thread Lu, Boying
Hi, experts,

I need to query all columns of a row in a column family that meet some 
conditions (see below).  The column is composite column and has following 
format:
... where componentN has String type.

What I want to do is to find out all columns that meet following conditions:

1.   component1 == prefix

2.   value1 <= component2 <= value2

3.   value3 <= component3 <= value4


I want to use
  CompositeRangeBuilder.withPrefix(prefix)  // to match component 1
   .greaterThanEquals(value1)
   .lessThenEquals(value2)  // 
to filter out component2
   .nextComponent()
   .greaterThanEquals(value3)
   .lessThenEquals(value4)  // 
to filter out component3

But I can only use greaterThenEquals() and lessThanEquals() once because the 
nextComponent() is not accessible.
Is there any solution to allow me to do such query?

Thanks a lot

Boying



Re: Query on blob col using CQL3

2014-02-28 Thread Peter Lin
why are you trying to view a blob with CQL3? and what kind of blob is it?

if the blob is an object, there's no way to view that in CQL3. You'd need
to do extra work like user defined types, but I don't know of anyone that's
actually using that.


On Fri, Feb 28, 2014 at 12:14 PM, Senthil, Athinanthny X. -ND <
athinanthny.x.senthil@disney.com> wrote:

> Anyone can suggest how to query on blob column via CQL3. I get  bad
> request error saying cannot parse data. I want to lookup on key column
> which is defined as blob.
>
> But I am able to lookup data via opscenter data explorer.  Is there a
> conversion functions I need to use?
>
>
>
>
> Sent from my Galaxy S®III
>


Re: Query on blob col using CQL3

2014-02-28 Thread Mikhail Stepura

Did you try http://cassandra.apache.org/doc/cql3/CQL.html#blobFun ?


On 2/28/14, 9:14, Senthil, Athinanthny X. -ND wrote:

Anyone can suggest how to query on blob column via CQL3. I get  bad
request error saying cannot parse data. I want to lookup on key column
which is defined as blob.

But I am able to lookup data via opscenter data explorer.  Is there a
conversion functions I need to use?




Sent from my Galaxy S®III





Query on blob col using CQL3

2014-02-28 Thread Senthil, Athinanthny X. -ND
Anyone can suggest how to query on blob column via CQL3. I get  bad request 
error saying cannot parse data. I want to lookup on key column which is defined 
as blob.

But I am able to lookup data via opscenter data explorer.  Is there a 
conversion functions I need to use?




Sent from my Galaxy S®III


Re: Periodic rpc_timeout errors on select query

2014-02-06 Thread Chap Lovejoy

Hi Steve,

It looks like it will be pretty easy for us to do some testing with the 
new client version. I'm going to give it a shot and keep my fingers 
crossed.


Thanks again,
Chap

On 5 Feb 2014, at 18:10, Steven A Robenalt wrote:


Hi Chap,

If you have the ability to test the 2.0.0rc2 driver, I would recommend
doing so, even from a dedicated test client or a JUnit test case. 
There are
other benefits to the change, such as being able to use 
BatchStatements,

aside from possible impact on your read timeouts.

Steve


Re: Periodic rpc_timeout errors on select query

2014-02-05 Thread Steven A Robenalt
Hi Chap,

If you have the ability to test the 2.0.0rc2 driver, I would recommend
doing so, even from a dedicated test client or a JUnit test case. There are
other benefits to the change, such as being able to use BatchStatements,
aside from possible impact on your read timeouts.

Steve



On Wed, Feb 5, 2014 at 3:06 PM, Chap Lovejoy  wrote:

> Hi Steve,
>
> Thanks for the reply. After all that information in my initial message I
> would forget one of the most important bits. We're running Cassandra 2.0.3
> with the 1.0.4 version of the DataStax driver.  I'd seen mention of those
> timeouts under earlier 2.x versions and really hoped they were the source
> of our problem but unfortunately that doesn't seem to be the case.
>
> Thanks again,
> Chap
>
>
> On 5 Feb 2014, at 17:49, Steven A Robenalt wrote:
>
>  Hi Chap,
>>
>> You don't indicate which version of Cassandra and what client side driver
>> you are using, but I have seen the same behavior with Cassandra 2.0.2 and
>> earlier versions of the Java Driver. With Cassandra 2.0.3 and the 2.0.0rc2
>> driver, my read timeouts are basically nonexistent at my current load
>> levels.
>>
>> Not sure how this applies if you're still on 1.x versions of Cassandra
>> since we moved off of that branch a few months ago. Ditto for the client
>> driver if you're using something other than the Java Driver, or the 1.x
>> version of same. Our problems were due to changes specific to the 2.x
>> versions only.
>>
>> Steve
>>
>


-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobe...@stanford.edu
http://highwire.stanford.edu


Re: Periodic rpc_timeout errors on select query

2014-02-05 Thread Chap Lovejoy

Hi Steve,

Thanks for the reply. After all that information in my initial message I 
would forget one of the most important bits. We're running Cassandra 
2.0.3 with the 1.0.4 version of the DataStax driver.  I'd seen mention 
of those timeouts under earlier 2.x versions and really hoped they were 
the source of our problem but unfortunately that doesn't seem to be the 
case.


Thanks again,
Chap

On 5 Feb 2014, at 17:49, Steven A Robenalt wrote:


Hi Chap,

You don't indicate which version of Cassandra and what client side 
driver
you are using, but I have seen the same behavior with Cassandra 2.0.2 
and
earlier versions of the Java Driver. With Cassandra 2.0.3 and the 
2.0.0rc2

driver, my read timeouts are basically nonexistent at my current load
levels.

Not sure how this applies if you're still on 1.x versions of Cassandra
since we moved off of that branch a few months ago. Ditto for the 
client
driver if you're using something other than the Java Driver, or the 
1.x

version of same. Our problems were due to changes specific to the 2.x
versions only.

Steve


Re: Periodic rpc_timeout errors on select query

2014-02-05 Thread Steven A Robenalt
Hi Chap,

You don't indicate which version of Cassandra and what client side driver
you are using, but I have seen the same behavior with Cassandra 2.0.2 and
earlier versions of the Java Driver. With Cassandra 2.0.3 and the 2.0.0rc2
driver, my read timeouts are basically nonexistent at my current load
levels.

Not sure how this applies if you're still on 1.x versions of Cassandra
since we moved off of that branch a few months ago. Ditto for the client
driver if you're using something other than the Java Driver, or the 1.x
version of same. Our problems were due to changes specific to the 2.x
versions only.

Steve



On Wed, Feb 5, 2014 at 2:14 PM, Chap Lovejoy  wrote:

> Hi,
>
> We're seeing pretty regular rpc timeout errors on what appear to be simple
> queries. We're running a three node cluster under pretty light load. We're
> averaging 30-40 writes/sec and about 8 reads/sec according to OpsCenter.
> The failures don't seem to be related to any changes in load. A single
> query repeated from CQLSH (about once a second or so) will fail
> approximately one out of ten times. I do see an increase in the average
> read latency around the time of the failure, though it's unclear if that's
> from the single failed request or if others are affected. This seems to
> happen most on a number of similarly structured tables. One is:
>
> CREATE TABLE psr (
>   inst_id bigint,
>   prosp_id bigint,
>   inter_id bigint,
>   avail text,
>   comments text,
>   email text,
>   first_name text,
>   last_name text,
>   m_id text,
>   m_num text,
>   phone text,
>   info blob,
>   status text,
>   time timestamp,
>   PRIMARY KEY ((inst_id, prosp_id), inter_id)
> ) WITH CLUSTERING ORDER BY (inter_id DESC) AND
>   bloom_filter_fp_chance=0.01 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.00 AND
>   gc_grace_seconds=864000 AND
>   index_interval=128 AND
>   read_repair_chance=0.10 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='99.0PERCENTILE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
>
> I'm executing the query:
> SELECT inter_id FROM "psr" WHERE inst_id = 1 AND prosp_id =
> 127788649174986752 AND inter_id < 30273563814527279 LIMIT 1;
>
> Normally this query returns 32 rows. A total of 413 match the partition
> key.
>
> Here is a trace for a successful run:
>   | timestamp| source
>| source_elapsed
> ------+-
> -+---+
>execute_cql3_query | 21:52:20,831 |
> 10.128.32.141 |  0
>  Message received from /10.128.32.141 | 21:52:20,826 |
> 10.128.32.140 | 69
>   Executing single-partition query on psr | 21:52:20,827 |
> 10.128.32.140 |502
>  Acquiring sstable references | 21:52:20,827 |
> 10.128.32.140 |517
>   Merging memtable tombstones | 21:52:20,827 |
> 10.128.32.140 |576
>  Key cache hit for sstable 54 | 21:52:20,827 |
> 10.128.32.140 |685
> Seeking to partition indexed section in data file | 21:52:20,827 |
> 10.128.32.140 |697
> Skipped 1/2 non-slice-intersecting sstables,  | 21:52:20,827 |
> 10.128.32.140 |751
> included 0 due to tombstones
>Merging data from memtables and 1 sstables | 21:52:20,827 |
> 10.128.32.140 |773
>   Read 32 live and 0 tombstoned cells | 21:52:20,828 |
> 10.128.32.140 |   2055
>  Enqueuing response to /10.128.32.141 | 21:52:20,829 |
> 10.128.32.140 |   2172
> Sending message to /10.128.32.141 | 21:52:20,829 |
> 10.128.32.140 |   2341
>   Parsing SELECT ...  | 21:52:20,831 |
> 10.128.32.141 |105
>   Preparing statement | 21:52:20,831 |
> 10.128.32.141 |200
> Sending message to /10.128.32.140 | 21:52:20,831 |
> 10.128.32.141 |492
>  Message received from /10.128.32.140 | 21:52:20,836 |
> 10.128.32.141 |   5361
>   Processing response from /10.128.32.140 | 21:52:20,836 |
> 10.128.32.141 |   5534
>  

Periodic rpc_timeout errors on select query

2014-02-05 Thread Chap Lovejoy

Hi,

We're seeing pretty regular rpc timeout errors on what appear to be 
simple queries. We're running a three node cluster under pretty light 
load. We're averaging 30-40 writes/sec and about 8 reads/sec according 
to OpsCenter. The failures don't seem to be related to any changes in 
load. A single query repeated from CQLSH (about once a second or so) 
will fail approximately one out of ten times. I do see an increase in 
the average read latency around the time of the failure, though it's 
unclear if that's from the single failed request or if others are 
affected. This seems to happen most on a number of similarly structured 
tables. One is:


CREATE TABLE psr (
  inst_id bigint,
  prosp_id bigint,
  inter_id bigint,
  avail text,
  comments text,
  email text,
  first_name text,
  last_name text,
  m_id text,
  m_num text,
  phone text,
  info blob,
  status text,
  time timestamp,
  PRIMARY KEY ((inst_id, prosp_id), inter_id)
) WITH CLUSTERING ORDER BY (inter_id DESC) AND
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

I'm executing the query:
SELECT inter_id FROM "psr" WHERE inst_id = 1 AND prosp_id = 
127788649174986752 AND inter_id < 30273563814527279 LIMIT 1;


Normally this query returns 32 rows. A total of 413 match the partition 
key.


Here is a trace for a successful run:
  | timestamp| 
source| source_elapsed

--+--+---+
   execute_cql3_query | 21:52:20,831 | 
10.128.32.141 |  0
     Message received from /10.128.32.141 | 21:52:20,826 | 
10.128.32.140 | 69
  Executing single-partition query on psr | 21:52:20,827 | 
10.128.32.140 |502
 Acquiring sstable references | 21:52:20,827 | 
10.128.32.140 |517
  Merging memtable tombstones | 21:52:20,827 | 
10.128.32.140 |576
 Key cache hit for sstable 54 | 21:52:20,827 | 
10.128.32.140 |685
Seeking to partition indexed section in data file | 21:52:20,827 | 
10.128.32.140 |697
Skipped 1/2 non-slice-intersecting sstables,  | 21:52:20,827 | 
10.128.32.140 |751

included 0 due to tombstones
   Merging data from memtables and 1 sstables | 21:52:20,827 | 
10.128.32.140 |773
  Read 32 live and 0 tombstoned cells | 21:52:20,828 | 
10.128.32.140 |   2055
 Enqueuing response to /10.128.32.141 | 21:52:20,829 | 
10.128.32.140 |   2172
Sending message to /10.128.32.141 | 21:52:20,829 | 
10.128.32.140 |   2341
  Parsing SELECT ...  | 21:52:20,831 | 
10.128.32.141 |105
  Preparing statement | 21:52:20,831 | 
10.128.32.141 |200
Sending message to /10.128.32.140 | 21:52:20,831 | 
10.128.32.141 |492
 Message received from /10.128.32.140 | 21:52:20,836 | 
10.128.32.141 |   5361
  Processing response from /10.128.32.140 | 21:52:20,836 | 
10.128.32.141 |   5534
 Request complete | 21:52:20,837 | 
10.128.32.141 |   6013



And here is on unsuccessful run:

   | timestamp| 
source| source_elapsed

---+--+---+---
execute_cql3_query | 21:56:19,792 | 
10.128.32.141 |  0
   Parsing SELECT ...  | 21:56:19,792 | 
10.128.32.141 | 69
   Preparing statement | 21:56:19,792 | 
10.128.32.141 |160
 Sending message to /10.128.32.137 | 21:56:19,792 | 
10.128.32.141 |509
  Message received from /10.128.32.141 | 21:56:19,793 | 
10.128.32.137 | 57
   Executing single-partition query on psr | 21:56:19,794 | 
10.128.32.137 |412
  Acquiring sstable references | 21:56:19,794 | 
10.128.32.137 |444
   Merging memtable tombstones | 21:56:19,794 | 
10.128.32.137 |486
  Key c

Re: Query on Seed node

2014-02-03 Thread Or Sher
I'm guessing its just a coincident.. As far as I know, seeds have nothing
to do with where the data should be located.
I think there could be couple of reasons why you wouldn't see SSTables on a
specific column family folder, these are some of them:
- You're using a few distinct keys which non of them should be on that seed
node.
- Node hasn't flushed yet.. You can use nodetool flush to try and flush
memtables manually.
- You're using manual token assignment and you didn't not assign them well.


On Mon, Feb 3, 2014 at 1:25 PM, Aravindan T  wrote:

> Hi ,
>
> I have a  4 node cassandra cluster with one node marked as seed node. When
> i checked the data directory of seed node , it has two folders
> /keyspace/columnfamily.
> But sstable db files are not available.the folder is empty.The db files
> are available in remaining nodes.
>
> I want to know the reason why db files are not created in seed node ?
> what will happen if all the nodes in a cluster is marked as seed node ??
>
>
>
> Aravindan Thangavelu
> Tata Consultancy Services
> Mailto: aravinda...@tcs.com
> Website: http://www.tcs.com
> 
> Experience certainty. IT Services
> Business Solutions
> Consulting
> 
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>


-- 
Or Sher


Query on Seed node

2014-02-03 Thread Aravindan T
 Hi ,

I  have a  4 node cassandra cluster with one node marked as seed node.  When i 
checked the data directory of seed node , it has two folders  
/keyspace/columnfamily.
But sstable db files are not available.the folder is empty.The db files are 
available in remaining nodes.

I want to know the reason why db files are not created in seed node ?
what will happen if all the nodes in a cluster is marked as seed node ??



Aravindan Thangavelu
Tata Consultancy Services
Mailto: aravinda...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Getting indexoutbound exception for a specific query on cassandra trunk

2014-01-16 Thread Naresh Yadav
I had taken latest source code of cassandra trunk to evaluate performance
of indexing on collections new feature(
https://issues.apache.org/jira/browse/CASSANDRA-4511) for my usecase..

IF you configure table like this with commands in given order :

CREATE TABLE testcollectionindex(userid text, timeunitid text, periodid
text, periodlabel text, periodtags text, unit text, datatags text,
datatagsset set,value double,PRIMARY KEY((unit,periodid), datatags));

INSERT INTO testcollectionindex(periodlabel, datatags, datatagsset,
value,timeunitid,periodid, unit) VALUES('Feb-2010', 'India|Pen|Store1',
{'India', 'Pen', 'Store1'}, 10,'Month','Period2','Number');

CREATE INDEX testcollectionindexdatatagsset ON testcollectionindex
(datatagsset);

SELECT * FROM testcollectionindex WHERE datatagsset CONTAINS 'Store1';
Output ( works perfectly ):
 unit   | periodid | datatags | datatagsset
| periodlabel
+--+--++-
 Number |  Period2 | India|Pen|Store1 | {'India', 'Pen', 'Store1'} |
Feb-2010

(1 rows)

*SELECT * FROM testcollectionindex WHERE periodid='Period2' AND
unit='Number' AND datatagsset CONTAINS 'Store1';*

THIS QUERY DO NOT WORK..I get RPC timeout error and server logs showing
indexoutofbound exception (http://pastebin.com/f7qmRc0R)

Deugging code for this query I get SliceQueryFilter [reversed=false,
slices=[[, ]], count=2147483647, toGroup = 1] because of that it throws
java.lang.ArrayIndexOutOfBoundsException: 0in
CompositesIndexOnCollectionKey.java method makeIndexColumnNameBuilder()

Note : I also tested this query on 03-Dec-2013 source code snapshot of
cassandra getting same exception there also

please someone help me on this so that i can proceed on this and conclude
on this new supported feature of cassandra.

Thanks
Naresh


Re: Problem inserting set when query contains IF NOT EXISTS.

2014-01-13 Thread Vladimir Prudnikov
Sorry, I thought I was running the latest version, but it was on this
instance...

[cqlsh 4.1.0 | Cassandra 2.0.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol
19.37.0]

I tried with 2.0.4 and it works just fine.


On Mon, Jan 13, 2014 at 9:07 PM, Huang, Roger  wrote:

>  Validimir,
>
> Thanks what version of Cassandra?
>
> -Roger
>
>
>
>
>
> *From:* Vladimir Prudnikov [mailto:v.prudni...@gmail.com]
> *Sent:* Monday, January 13, 2014 11:57 AM
> *To:* user
> *Subject:* Problem inserting set when query contains IF NOT EXISTS.
>
>
>
> Hi all,
>
> I've spend a lot of time finding a bug in system, but it turns out that
> the problem is in Cassandra.
>
>
>
> Here is how to reproduce.
>
>
>
> =
>
> CREATE KEYSPACE IF NOT EXISTS test_set WITH REPLICATION = { 'class' :
> 'SimpleStrategy', 'replication_factor' : 1 };
>
> USE test_set;
>
>
>
> CREATE TABLE IF NOT EXISTS user (
>
> key timeuuid PRIMARY KEY,
>
> username text,
>
> email text,
>
> first_name text,
>
> last_name text,
>
> features set,
>
> ) WITH caching='all';
>
>
>
> INSERT INTO user(key,username,email,first_name,last_name,features) VALUES
> (now(),'ainsttp0ess2kiphu2pe1bbrle','
> l3b7brn6jp9e8s0mmsr7ae5...@mmcm4jf9a9g9b95c053ksbsi18.com','gqh9ekmv6vc9nf1ce8eo3rjcdd','fmg92158br9ddivoj59417q514',{'i6v8i4a5gpnris5chjibllqf0','480m4c1obiq61ilii1g7rm0v17','50kovlifrtrtqihnvmbefaeacl'})
> IF NOT EXISTS;
>
>
>
> select * from user;
>
> ==
>
>
>
> The problem here is that user.features is null instead of set of 3 strings.
>
> If you remove `IF NOT EXISTS` it executes correctly and set of string will
> be inserted.
>
>
>
> I don't see any problem with the queries, seems to be the problem with C*.
>
>
>
> --
> Vladimir Prudnikov
>



-- 
Vladimir Prudnikov


RE: Problem inserting set when query contains IF NOT EXISTS.

2014-01-13 Thread Huang, Roger
Validimir,
Thanks what version of Cassandra?
-Roger


From: Vladimir Prudnikov [mailto:v.prudni...@gmail.com]
Sent: Monday, January 13, 2014 11:57 AM
To: user
Subject: Problem inserting set when query contains IF NOT EXISTS.

Hi all,
I've spend a lot of time finding a bug in system, but it turns out that the 
problem is in Cassandra.

Here is how to reproduce.

=
CREATE KEYSPACE IF NOT EXISTS test_set WITH REPLICATION = { 'class' : 
'SimpleStrategy', 'replication_factor' : 1 };
USE test_set;

CREATE TABLE IF NOT EXISTS user (
key timeuuid PRIMARY KEY,
username text,
email text,
first_name text,
last_name text,
features set,
) WITH caching='all';

INSERT INTO user(key,username,email,first_name,last_name,features) VALUES 
(now(),'ainsttp0ess2kiphu2pe1bbrle','l3b7brn6jp9e8s0mmsr7ae5...@mmcm4jf9a9g9b95c053ksbsi18.com<mailto:l3b7brn6jp9e8s0mmsr7ae5...@mmcm4jf9a9g9b95c053ksbsi18.com>','gqh9ekmv6vc9nf1ce8eo3rjcdd','fmg92158br9ddivoj59417q514',{'i6v8i4a5gpnris5chjibllqf0','480m4c1obiq61ilii1g7rm0v17','50kovlifrtrtqihnvmbefaeacl'})
 IF NOT EXISTS;

select * from user;
==

The problem here is that user.features is null instead of set of 3 strings.
If you remove `IF NOT EXISTS` it executes correctly and set of string will be 
inserted.

I don't see any problem with the queries, seems to be the problem with C*.

--
Vladimir Prudnikov


Problem inserting set when query contains IF NOT EXISTS.

2014-01-13 Thread Vladimir Prudnikov
Hi all,
I've spend a lot of time finding a bug in system, but it turns out that the
problem is in Cassandra.

Here is how to reproduce.

=
CREATE KEYSPACE IF NOT EXISTS test_set WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor' : 1 };
USE test_set;

CREATE TABLE IF NOT EXISTS user (
 key timeuuid PRIMARY KEY,
username text,
email text,
 first_name text,
last_name text,
features set,
) WITH caching='all';

INSERT INTO user(key,username,email,first_name,last_name,features) VALUES
(now(),'ainsttp0ess2kiphu2pe1bbrle','
l3b7brn6jp9e8s0mmsr7ae5...@mmcm4jf9a9g9b95c053ksbsi18.com','gqh9ekmv6vc9nf1ce8eo3rjcdd','fmg92158br9ddivoj59417q514',{'i6v8i4a5gpnris5chjibllqf0','480m4c1obiq61ilii1g7rm0v17','50kovlifrtrtqihnvmbefaeacl'})
IF NOT EXISTS;

select * from user;
==

The problem here is that user.features is null instead of set of 3 strings.
If you remove `IF NOT EXISTS` it executes correctly and set of string will
be inserted.

I don't see any problem with the queries, seems to be the problem with C*.

-- 
Vladimir Prudnikov


Re: Multi-Column Slice Query w/ Partial Component of Composite Key

2013-12-22 Thread Edward Capriolo
You CAN only supply some of the components for a slice.


On Fri, Dec 20, 2013 at 2:13 PM, Josh Dzielak  wrote:

>  Is there a way to include *multiple* column names in a slice query where
> one only component of the composite column name key needs to match?
>
> For example, if this was a single row -
>
> username:0   |   username:1   |  city:0   |  city:1 |   other:0|
> other:1
>
> ---
> bob  |   ted  |  sf   |  nyc|   foo|
> bar
>
> I can do a slice with "username:0" and "city:1" or any fully identified
> column names. I also can do a range query w/ first component equal to
> "username", and set the bounds for the second component of the key to +/-
> infinity (or \u0 to \u for utf8), and get all columns back that
> start with "username".
>
> But what if I want to get all usernames and all cities? Without composite
> keys this would be easy - just slice on a collection of column names -
> ["username", "city"]. With composite column names it would have to look
> something like ["username:*", "city:*"], where * represents a wildcard or a
> range.
>
> My questions –
>
> 1) Is this supported in the Thrift interface or CQL?
> 2) If not, is there clever data modeling or indexing that could accomplish
> this use case? 1 single-row round-trip to get these columns?
> 3) Is there plans to support this in the future? Generally, what is the
> future of composite columns in a CQL world?
>
> Thanks!
> Josh
>


Re: Multi-Column Slice Query w/ Partial Component of Composite Key

2013-12-20 Thread Josh Dzielak
Thanks Nate.  

I will take a look at extending thrift, seems like this could be useful for 
some folks.  


On Friday, December 20, 2013 at 12:29 PM, Nate McCall wrote:

> >  
> > My questions –
> >  
> > 1) Is this supported in the Thrift interface or CQL?
>  
> Not directly, no.  
>   
> > 2) If not, is there clever data modeling or indexing that could accomplish 
> > this use case? 1 single-row round-trip to get these columns?
> >  
>  
>  
> If this is a query done frequently you could prefix both columns with a 
> static value, eg. ["foo:username", foo:city...", "bar:other_column:..."] 
> so in this specific case you look for 'foo:*'  
>   
> > 3) Is there plans to support this in the future? Generally, what is the 
> > future of composite columns in a CQL world?
> >  
>  
> You can always extend cassandra.thrift and add a custom method (not as hard 
> as it sounds - Thrift is designed for this). Side note: DataStax Enterprise 
> works this way for reading the CassandraFileSystem blocks. An early 
> prototype:  
> https://github.com/riptano/brisk/blob/master/interface/brisk.thrift#L68-L80  
>  
>  
>  
> --  
> -
> Nate McCall
> Austin, TX
> @zznate
>  
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com  



Re: Multi-Column Slice Query w/ Partial Component of Composite Key

2013-12-20 Thread Nate McCall
>
>
> My questions –
>
> 1) Is this supported in the Thrift interface or CQL?
>

Not directly, no.


> 2) If not, is there clever data modeling or indexing that could accomplish
> this use case? 1 single-row round-trip to get these columns?
>

If this is a query done frequently you could prefix both columns with a
static value, eg. ["foo:username", foo:city...",
"bar:other_column:..."] so in this specific case you look for 'foo:*'


> 3) Is there plans to support this in the future? Generally, what is the
> future of composite columns in a CQL world?
>
>
You can always extend cassandra.thrift and add a custom method (not as hard
as it sounds - Thrift is designed for this). Side note: DataStax Enterprise
works this way for reading the CassandraFileSystem blocks. An early
prototype:
https://github.com/riptano/brisk/blob/master/interface/brisk.thrift#L68-L80


-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Multi-Column Slice Query w/ Partial Component of Composite Key

2013-12-20 Thread Josh Dzielak
Is there a way to include *multiple* column names in a slice query where one 
only component of the composite column name key needs to match?  

For example, if this was a single row -

username:0   |   username:1   |  city:0   |  city:1 |   other:0|   
other:1
---
bob  |   ted  |  sf   |  nyc|   foo|   bar

I can do a slice with "username:0" and "city:1" or any fully identified column 
names. I also can do a range query w/ first component equal to "username", and 
set the bounds for the second component of the key to +/- infinity (or \u0 
to \u for utf8), and get all columns back that start with "username".

But what if I want to get all usernames and all cities? Without composite keys 
this would be easy - just slice on a collection of column names - ["username", 
"city"]. With composite column names it would have to look something like 
["username:*", "city:*"], where * represents a wildcard or a range.

My questions –

1) Is this supported in the Thrift interface or CQL?
2) If not, is there clever data modeling or indexing that could accomplish this 
use case? 1 single-row round-trip to get these columns?
3) Is there plans to support this in the future? Generally, what is the future 
of composite columns in a CQL world?

Thanks!
Josh



Multi-range of composite query possible?

2013-11-27 Thread Ravikumar Govindarajan
We have the following structure in a composite CF, comprising 2 parts

Key=123  -> A:1, A:2, A:3,B:1, B:2, B:3, B:4, C:1, C:2, C:3,

Our application provides the following inputs for querying on the
first-part of composite column

key=123, [(colName=A, range=2), (colName=B, range=3), (colName=C, range=1)]

The below output is desired

key=123 --> A:1, A:2 [Get first 2 composite cols for prefix 'A']

   B:1, B:2, B:3 [Get first 3 composite cols for prefix 'B']

   C:1 [Get the first composite col for prefix 'C']

I see that this akin to a "range-of-range" query via composite columns. Is
something like this possible in cassandra, may be in latest versions?

--
Ravi


Re: Cannot restrict PRIMARY KEY part bucket_id by IN relation as a collection is selected by the query

2013-11-11 Thread Aaron Morton
The code just says we cannot support it yet, it may come in the future:

// We only support IN for the last name and for compact 
storage so far
// TODO: #3885 allows us to extend to non compact as well, 
but that remains to be done

> Should this be modelled in a different way in Cassandra? Could you please 
> advice?


Depends on what you are doing with the map column. 

This is roughly the same as using the map, but in a different table:

CREATE TABLE device_map (
  mdid text,
  bucket_id text,
  map_key   text, 
  map_value text,
  PRIMARY KEY(  (mdid, bucket_id), map_key)
)

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 8/11/2013, at 1:05 am, pavli...@gmail.com wrote:

> Hey guys, just started to learn Cassandra recently, got a simple (hopefully) 
> question on querying.
> 
> There's a table with composite primary key - mdid and bucket_id. So I assume 
> mdid is going to be a partition key and bucket_id is a clustering key. 
> There're also two more columns to hold a text and a map. See 
> http://pastie.org/private/fcygmm891hgg4ugyjhtjg for a full picture.
> 
> So, I am basically going to have a big row with may buckets. In my 
> application I am going to retrieve a subset of buckets, not all of them at 
> once, so I do this:
> 
> select  where mdid='1' and bucket_id in ('global_props', 'test_bucket')
> 
> But that gives the error in the subject.
> 
> There's pretty interesting thing is that if I query for text column then the 
> query works, while does not work for the map column. Check the two queries at 
> the bottom http://pastie.org/private/fcygmm891hgg4ugyjhtjg please.
> 
> Should this be modelled in a different way in Cassandra? Could you please 
> advice?
> 
> Thanks, 
> Pavlo



Cannot restrict PRIMARY KEY part bucket_id by IN relation as a collection is selected by the query

2013-11-07 Thread pavli...@gmail.com
Hey guys, just started to learn Cassandra recently, got a simple
(hopefully) question on querying.

There's a table with composite primary key - mdid and bucket_id. So I
assume mdid is going to be a partition key and bucket_id is a clustering
key. There're also two more columns to hold a text and a map. See
http://pastie.org/private/fcygmm891hgg4ugyjhtjg for a full picture.

So, I am basically going to have a big row with may buckets. In my
application I am going to retrieve a subset of buckets, not all of them at
once, so I do this:

select  where mdid='1' and bucket_id in ('global_props', 'test_bucket')


But that gives the error in the subject.

There's pretty interesting thing is that if I query for text column then
the query works, while does not work for the map column. Check the two
queries at the bottom http://pastie.org/private/fcygmm891hgg4ugyjhtjgplease.

Should this be modelled in a different way in Cassandra? Could you please
advice?

Thanks,
Pavlo


Cassandra Data Query

2013-11-03 Thread Chandana Tummala
Hi,

we are  using a 6 node cluster with 3 nodes in each DC with replication
factor:3
cassandra version :- dse-3.1.1 version 
we have to load data into the cluster for every two hrs using Java driver
batch program
presently data size  in cluster is 2TB
I want to validate the data loaded.
so using 
select count(*) from table name;
giving me request time out error

so , using 

select count(*) from table name where secondary_index='';
giving me request time out error

Can you please suggest me how to validate the data loaded.
for a load max of 1 GB data is loaded into the cluster.

Is there any way i can validate the count of data loaded.






--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Data-Query-tp7591180.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Query a datacenter

2013-10-29 Thread srmore
Thanks Rob that helps !


On Fri, Oct 25, 2013 at 7:34 PM, Robert Coli  wrote:

> On Fri, Oct 25, 2013 at 2:47 PM, srmore  wrote:
>
>> I don't know whether this is possible but was just curious, can you query
>> for the data in the remote datacenter with a CL.ONE ?
>>
>
> A coordinator at CL.ONE picks which replica(s) to query based in large
> part on the dynamic snitch. If your remote data center has a lower badness
> score from the perspective of the dynamic snitch, a CL.ONE request might go
> there.
>
> 1.2.11 adds [1] a LOCAL_ONE consistencylevel which does the opposite of
> what you are asking, restricting CL.ONE from going cross-DC.
>
>
>> There could be a case where one might not have a QUORUM and would like to
>> read the most recent  data which includes the data from the other
>> datacenter. AFAIK to reliably read the data from other datacenter we only
>> have CL.EACH_QUORUM.
>>
>
> Using CL.QUORUM requires a QUORUM number of responses, it does not care
> from which data center those responses come.
>
>
>> Also, is there a way one can control how frequently the data is
>> replicated across the datacenters ?
>>
>
> Data centers don't really exist in this context [2], so your question is
> "can one control how frequently data is replicated between replicas" and
> the answer is "no." All replication always goes to every replica.
>
> =Rob
> [1] https://issues.apache.org/jira/browse/CASSANDRA-6202
> [2] this is slightly glib/reductive/inaccurate, but accurate enough for
> the purposes of this response.
>


Re: Query a datacenter

2013-10-25 Thread Robert Coli
On Fri, Oct 25, 2013 at 2:47 PM, srmore  wrote:

> I don't know whether this is possible but was just curious, can you query
> for the data in the remote datacenter with a CL.ONE ?
>

A coordinator at CL.ONE picks which replica(s) to query based in large part
on the dynamic snitch. If your remote data center has a lower badness score
from the perspective of the dynamic snitch, a CL.ONE request might go there.

1.2.11 adds [1] a LOCAL_ONE consistencylevel which does the opposite of
what you are asking, restricting CL.ONE from going cross-DC.


> There could be a case where one might not have a QUORUM and would like to
> read the most recent  data which includes the data from the other
> datacenter. AFAIK to reliably read the data from other datacenter we only
> have CL.EACH_QUORUM.
>

Using CL.QUORUM requires a QUORUM number of responses, it does not care
from which data center those responses come.


> Also, is there a way one can control how frequently the data is replicated
> across the datacenters ?
>

Data centers don't really exist in this context [2], so your question is
"can one control how frequently data is replicated between replicas" and
the answer is "no." All replication always goes to every replica.

=Rob
[1] https://issues.apache.org/jira/browse/CASSANDRA-6202
[2] this is slightly glib/reductive/inaccurate, but accurate enough for the
purposes of this response.


Query a datacenter

2013-10-25 Thread srmore
I don't know whether this is possible but was just curious, can you query
for the data in the remote datacenter with a CL.ONE ?

There could be a case where one might not have a QUORUM and would like to
read the most recent  data which includes the data from the other
datacenter. AFAIK to reliably read the data from other datacenter we only
have CL.EACH_QUORUM.


Also, is there a way one can control how frequently the data is replicated
across the datacenters ?

Thanks !


com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query

2013-10-06 Thread Ran Tavory
Hi all, when using the java-driver I see this error on the client, for
reads (as well as for writes).
Many of the ops succeed, however I do see a significant amount of errors.

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra
timeout during write query at consistency ONE (1 replica were required but
only 0 acknowledged the write)
at
com.datastax.driver.core.ResultSetFuture.convertException(ResultSetFuture.java:243)
 at
com.datastax.driver.core.ResultSetFuture$ResponseCallback.onSet(ResultSetFuture.java:119)
at
com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:202)
 at com.datastax.driver.core.RequestHandler.onSet(RequestHandler.java:331)
at
com.datastax.driver.core.Connection$Dispatcher.messageReceived(Connection.java:484)
 at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

The cluster itself isn't working very hard and seems to be in good shape...
CPU Load is around 0.1, IO wait is below 1%, all hosts are up, not flapping
of anything and the logs don't indicate any special GC activity...


So I'm a bit puzzled as to where to look next. Any hints?...

-- 
/Ran
http://tavory.com


Re: Query about class org.apache.cassandra.io.sstable.SSTableSimpleWriter

2013-09-30 Thread Aaron Morton
> Thanks for the reply. Isn't the addColumn(IColumn col) method in the writer 
> private though?
> 
> 

Yes but I thought you had it in your examples, was included for completeness. 
use the official overloads. 

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 27/09/2013, at 4:12 PM, Jayadev Jayaraman  wrote:

> Thanks for the reply. Isn't the addColumn(IColumn col) method in the writer 
> private though? I know what to do now in order to construct a column with a 
> TTL now. Thanks.
> 
> On Sep 26, 2013 9:00 PM, "Aaron Morton"  wrote:
> > org.apache.cassandra.thrift.Column column; // initialize this with name, 
> > value, timestamp, TTL
> This is the wrong object to use.
> 
> one overload of addColumn() accepts IColumn which is from 
> org.apache.cassanda.db . The thrift classes are only use for the thrift API.
> 
> > What is the difference between calling writer.addColumn() on the column's 
> > name, value and timestamp, and writer.addExpiringColumn() on the column's 
> > name, value, TTL, timestamp and expiration timestamp ?
> They both add an column to the row. addExpiringColumn() adds an expiring 
> column, and addColumn adds a normal one.
> 
> only addExpiringColumn accepts a TTL (in seconds) for the column.
> 
> 
> > Does the former result in the column expiring still , in cassandra 1.2.x 
> > (i.e. does setting the TTL on a Column object change the name or value in a 
> > way so as to ensure the column will expire as required) ?
> No.
> An expiring column must be an ExpiringColumn column instance.
> The base IColumn interface does not have a TTL, only expiring columns do.
> 
> >  If not , what is the TTL attribute used for in the Column object ?
> The org.apache.cassandra.db.Column class does not have a TTL.
> 
> Cheers
> 
> 
> -
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 26/09/2013, at 12:44 AM, Jayadev Jayaraman  wrote:
> 
> > Can someone answer this doubt reg. SSTableSimpleWriter ? I'd asked about 
> > this earlier but it probably missed. Apologies for repeating the question 
> > (with minor additions)  :
> >
> > """
> > Let's say I've initialized a SSTableSimpleWriter instance and a new column 
> > with TTL set :
> >
> > org.apache.cassandra.io.sstable.SSTableSimpleWriter writer = new 
> > SSTableSimpleWriter( ... /* params here */);
> > org.apache.cassandra.thrift.Column column; // initialize this with name, 
> > value, timestamp, TTL
> >
> > What is the difference between calling writer.addColumn() on the column's 
> > name, value and timestamp, and writer.addExpiringColumn() on the column's 
> > name, value, TTL, timestamp and expiration timestamp ? Does the former 
> > result in the column expiring still , in cassandra 1.2.x (i.e. does setting 
> > the TTL on a Column object change the name or value in a way so as to 
> > ensure the column will expire as required) ? If not , what is the TTL 
> > attribute used for in the Column object ?
> > """
> >
> > Thanks,
> > Jayadev
> >
> >
> > On Tue, Sep 24, 2013 at 2:48 PM, Jayadev Jayaraman  
> > wrote:
> > Let's say I've initialized a SSTableSimpleWriter instance and a new column 
> > with TTL set :
> >
> > SSTableSimpleWriter writer = new SSTableSimpleWriter( ... /* params here 
> > */);
> > Column column;
> >
> > What is the difference between calling writer.addColumn() on the column's 
> > name and value, and writer.addExpiringColumn() on the column and its TTL ? 
> > Does the former result in the column expiring still , in cassandra 1.2.x ? 
> > Or does it not ?
> >
> >
> >
> 



Re: Query about class org.apache.cassandra.io.sstable.SSTableSimpleWriter

2013-09-26 Thread Jayadev Jayaraman
Thanks for the reply. Isn't the addColumn(IColumn col) method in the writer
private though? I know what to do now in order to construct a column with a
TTL now. Thanks.
On Sep 26, 2013 9:00 PM, "Aaron Morton"  wrote:

> > org.apache.cassandra.thrift.Column column; // initialize this with name,
> value, timestamp, TTL
> This is the wrong object to use.
>
> one overload of addColumn() accepts IColumn which is from
> org.apache.cassanda.db . The thrift classes are only use for the thrift API.
>
> > What is the difference between calling writer.addColumn() on the
> column's name, value and timestamp, and writer.addExpiringColumn() on the
> column's name, value, TTL, timestamp and expiration timestamp ?
> They both add an column to the row. addExpiringColumn() adds an expiring
> column, and addColumn adds a normal one.
>
> only addExpiringColumn accepts a TTL (in seconds) for the column.
>
>
> > Does the former result in the column expiring still , in cassandra 1.2.x
> (i.e. does setting the TTL on a Column object change the name or value in a
> way so as to ensure the column will expire as required) ?
> No.
> An expiring column must be an ExpiringColumn column instance.
> The base IColumn interface does not have a TTL, only expiring columns do.
>
> >  If not , what is the TTL attribute used for in the Column object ?
> The org.apache.cassandra.db.Column class does not have a TTL.
>
> Cheers
>
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 26/09/2013, at 12:44 AM, Jayadev Jayaraman  wrote:
>
> > Can someone answer this doubt reg. SSTableSimpleWriter ? I'd asked about
> this earlier but it probably missed. Apologies for repeating the question
> (with minor additions)  :
> >
> > """
> > Let's say I've initialized a SSTableSimpleWriter instance and a new
> column with TTL set :
> >
> > org.apache.cassandra.io.sstable.SSTableSimpleWriter writer = new
> SSTableSimpleWriter( ... /* params here */);
> > org.apache.cassandra.thrift.Column column; // initialize this with name,
> value, timestamp, TTL
> >
> > What is the difference between calling writer.addColumn() on the
> column's name, value and timestamp, and writer.addExpiringColumn() on the
> column's name, value, TTL, timestamp and expiration timestamp ? Does the
> former result in the column expiring still , in cassandra 1.2.x (i.e. does
> setting the TTL on a Column object change the name or value in a way so as
> to ensure the column will expire as required) ? If not , what is the TTL
> attribute used for in the Column object ?
> > """
> >
> > Thanks,
> > Jayadev
> >
> >
> > On Tue, Sep 24, 2013 at 2:48 PM, Jayadev Jayaraman 
> wrote:
> > Let's say I've initialized a SSTableSimpleWriter instance and a new
> column with TTL set :
> >
> > SSTableSimpleWriter writer = new SSTableSimpleWriter( ... /* params here
> */);
> > Column column;
> >
> > What is the difference between calling writer.addColumn() on the
> column's name and value, and writer.addExpiringColumn() on the column and
> its TTL ? Does the former result in the column expiring still , in
> cassandra 1.2.x ? Or does it not ?
> >
> >
> >
>
>


Re: Query about class org.apache.cassandra.io.sstable.SSTableSimpleWriter

2013-09-26 Thread Aaron Morton
> org.apache.cassandra.thrift.Column column; // initialize this with name, 
> value, timestamp, TTL
This is the wrong object to use. 

one overload of addColumn() accepts IColumn which is from 
org.apache.cassanda.db . The thrift classes are only use for the thrift API. 

> What is the difference between calling writer.addColumn() on the column's 
> name, value and timestamp, and writer.addExpiringColumn() on the column's 
> name, value, TTL, timestamp and expiration timestamp ?
They both add an column to the row. addExpiringColumn() adds an expiring 
column, and addColumn adds a normal one. 

only addExpiringColumn accepts a TTL (in seconds) for the column.


> Does the former result in the column expiring still , in cassandra 1.2.x 
> (i.e. does setting the TTL on a Column object change the name or value in a 
> way so as to ensure the column will expire as required) ? 
No. 
An expiring column must be an ExpiringColumn column instance. 
The base IColumn interface does not have a TTL, only expiring columns do. 

>  If not , what is the TTL attribute used for in the Column object ?
The org.apache.cassandra.db.Column class does not have a TTL. 

Cheers
  

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 26/09/2013, at 12:44 AM, Jayadev Jayaraman  wrote:

> Can someone answer this doubt reg. SSTableSimpleWriter ? I'd asked about this 
> earlier but it probably missed. Apologies for repeating the question (with 
> minor additions)  : 
> 
> """
> Let's say I've initialized a SSTableSimpleWriter instance and a new column 
> with TTL set : 
> 
> org.apache.cassandra.io.sstable.SSTableSimpleWriter writer = new 
> SSTableSimpleWriter( ... /* params here */);
> org.apache.cassandra.thrift.Column column; // initialize this with name, 
> value, timestamp, TTL
> 
> What is the difference between calling writer.addColumn() on the column's 
> name, value and timestamp, and writer.addExpiringColumn() on the column's 
> name, value, TTL, timestamp and expiration timestamp ? Does the former result 
> in the column expiring still , in cassandra 1.2.x (i.e. does setting the TTL 
> on a Column object change the name or value in a way so as to ensure the 
> column will expire as required) ? If not , what is the TTL attribute used for 
> in the Column object ?
> """
> 
> Thanks,
> Jayadev
> 
> 
> On Tue, Sep 24, 2013 at 2:48 PM, Jayadev Jayaraman  
> wrote:
> Let's say I've initialized a SSTableSimpleWriter instance and a new column 
> with TTL set : 
> 
> SSTableSimpleWriter writer = new SSTableSimpleWriter( ... /* params here */);
> Column column;
> 
> What is the difference between calling writer.addColumn() on the column's 
> name and value, and writer.addExpiringColumn() on the column and its TTL ? 
> Does the former result in the column expiring still , in cassandra 1.2.x ? Or 
> does it not ?
> 
> 
> 



Re: Query about class org.apache.cassandra.io.sstable.SSTableSimpleWriter

2013-09-25 Thread Jayadev Jayaraman
Can someone answer this doubt reg. SSTableSimpleWriter ? I'd asked about
this earlier but it probably missed. Apologies for repeating the question
(with minor additions)  :

"""
Let's say I've initialized a *SSTableSimpleWriter* instance and a new
column with TTL set :

*org.apache.cassandra.io.sstable.SSTableSimpleWriter writer = new
SSTableSimpleWriter( ... /* params here */);*
*org.apache.cassandra.thrift.Column column; // initialize this with name,
value, timestamp, TTL*

What is the difference between calling *writer.addColumn()* on the column's
name, value and timestamp, and *writer.addExpiringColumn()* on the column's
name, value, TTL, timestamp and expiration timestamp ? Does the former
result in the column expiring still , in cassandra 1.2.x (i.e. does setting
the TTL on a Column object change the name or value in a way so as to
ensure the column will expire as required) ? If not , what is the TTL
attribute used for in the Column object ?
"""

Thanks,
Jayadev


On Tue, Sep 24, 2013 at 2:48 PM, Jayadev Jayaraman wrote:

> Let's say I've initialized a *SSTableSimpleWriter* instance and a new
> column with TTL set :
>
> *SSTableSimpleWriter writer = new SSTableSimpleWriter( ... /* params here
> */);*
> *Column column;*
>
> What is the difference between calling *writer.addColumn()* on the
> column's name and value, and *writer.addExpiringColumn()* on the column
> and its TTL ? Does the former result in the column expiring still , in
> cassandra 1.2.x ? Or does it not ?
>
>
>


Query about class org.apache.cassandra.io.sstable.SSTableSimpleWriter

2013-09-24 Thread Jayadev Jayaraman
Let's say I've initialized a *SSTableSimpleWriter* instance and a new
column with TTL set :

*SSTableSimpleWriter writer = new SSTableSimpleWriter( ... /* params here
*/);*
*Column column;*

What is the difference between calling *writer.addColumn()* on the column's
name and value, and *writer.addExpiringColumn()* on the column and its TTL
? Does the former result in the column expiring still , in cassandra 1.2.x
? Or does it not ?


select count query not working at cassandra 2.0.0

2013-09-19 Thread Katsutoshi
I would like to use select count query.
Although it was work at Cassandra 1.2.9, but there is a situation which
does not work at Cassandra 2.0.0.
so, If some row is deleted, 'select count query' seems to return the wrong
value.
Did anything change by Cassandra 2.0.0 ? or Have I made a mistake ?

My test procedure is as follows:

### At Cassandra 1.2.9

1) create table, and insert two rows

```
cqlsh:test> CREATE TABLE count_hash_test (key text, value text, PRIMARY KEY
(key));
cqlsh:test> INSERT INTO count_hash_test (key, value) VALUES ('key1',
'value');
cqlsh:test> INSERT INTO count_hash_test (key, value) VALUES ('key2',
'value');
```

2) do a select count query, it returns 2 which is expected

```
cqlsh:test> SELECT * FROM count_hash_test;

 key  | value
--+---
 key1 | value
 key2 | value

cqlsh:test> SELECT COUNT(*) FROM count_hash_test;

 count
---
 2
```

3) delete one row

```
cqlsh:test> DELETE FROM count_hash_test WHERE key='key1';
```

4) do a select count query, it returns 1 which is expected

```
cqlsh:test> SELECT * FROM count_hash_test;

 key  | value
--+---
 key2 | value

cqlsh:test> SELECT COUNT(*) FROM count_hash_test;

 count
---
 1
```

### At Cassandra 2.0.0

1) create table, and insert two rows

```
cqlsh:test> CREATE TABLE count_hash_test (key text, value text, PRIMARY KEY
(key));
cqlsh:test> INSERT INTO count_hash_test (key, value) VALUES ('key1',
'value');
cqlsh:test> INSERT INTO count_hash_test (key, value) VALUES ('key2',
'value');
```

2) do a select count query, it returns 2  which is expected

```
cqlsh:test> SELECT * FROM count_hash_test;

 key  | value
--+---
 key1 | value
 key2 | value

cqlsh:test> SELECT COUNT(*) FROM count_hash_test;

 count
---
 2
```

3) delete one row

```
cqlsh:test> DELETE FROM count_hash_test WHERE key='key1';
```

4) do a select count query, but it returns 0 which is NOT expected

```
cqlsh:test> SELECT * FROM count_hash_test;

 key  | value
--+---
 key2 | value

cqlsh:test> SELECT COUNT(*) FROM count_hash_test;

 count
---
 0
```

Could anyone help me for this? thanks.

Katsutoshi


Re: Read query slows down when a node goes down

2013-09-16 Thread sankalp kohli
Repair should not take that long since you have very less data. Check the
logs of other machines with which it is repairing to find anything
interesting.


On Mon, Sep 16, 2013 at 10:15 AM, Parag Patel wrote:

>  Thanks.  I’ve noticed that a repair takes a long to time to finish.  My
> data is quite small, 1.5GB on each node when running nodetool status.  Is
> there anyway to speed up repairs? (FYI, I haven’t actually seen a repair
> finish since it didn’t retrun after 10 mins – I figured I was doing
> something wrong).
>
> ** **
>
> *From:* sankalp kohli [mailto:kohlisank...@gmail.com]
> *Sent:* Monday, September 16, 2013 1:10 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Read query slows down when a node goes down
>
> ** **
>
> For how long does the read latencies go up once a machine is down? It
> takes a configurable amount of time for machines to detect that another
> machine is down. This is done through Gossip. The algo to detect failures
> is The Phi accrual failure detector.
>
> ** **
>
> Regarding your question, if you are bootstrapping then it need to get the
> data from other nodes and during this time, it will not serve any reads but
> will accept writes. Once it has all the data, it will start serving reads.
> In the logs it will have something like "now serving reads"
>
> .
>
> If you are bringing back a machine which is offline, then it will start
> accepting reads and writes immediately but then you should run a repair to
> get the missing data. 
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> On Mon, Sep 16, 2013 at 8:12 AM, Parag Patel 
> wrote:
>
> RF=3.  Single dc deployment.  No v-nodes.
>
>  
>
> Is there a certain amount of time I need to wait from the time the down
> node is started to the point where it’s ready to be used?  If so, what’s
> that time?  If it’s dynamic, how would I know when it’s ready?
>
>  
>
> Thanks,
>
> Parag
>
>  
>
> *From:* sankalp kohli [mailto:kohlisank...@gmail.com]
> *Sent:* Sunday, September 15, 2013 4:52 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Read query slows down when a node goes down
>
>  
>
> What is your replication factor? DO you have multi-DC deployment? Also are
> u using v nodes?
>
>  
>
> On Sun, Sep 15, 2013 at 7:54 AM, Parag Patel 
> wrote:
>
> Hi,
>
>  
>
> We have a six node cluster running DataStax Community Edition 1.2.9.  From
> our app, we use the Netflix Astyanax library to read and write records into
> our cluster.  We read and write with QUARUM.  We’re experiencing an issue
> where when a node goes down, we see our read queries slowing down in our
> app whenever a node goes offline.  This is a problem that is very
> reproducible.  Has anybody experienced this before or do people have
> suggestions on what I could try?
>
>  
>
> Thanks,
>
> Parag
>
>  
>
> ** **
>


Re: Read query slows down when a node goes down

2013-09-16 Thread sankalp kohli
For how long does the read latencies go up once a machine is down? It takes
a configurable amount of time for machines to detect that another machine
is down. This is done through Gossip. The algo to detect failures is The
Phi accrual failure detector.

Regarding your question, if you are bootstrapping then it need to get the
data from other nodes and during this time, it will not serve any reads but
will accept writes. Once it has all the data, it will start serving reads.
In the logs it will have something like "now serving reads"
.
If you are bringing back a machine which is offline, then it will start
accepting reads and writes immediately but then you should run a repair to
get the missing data.






On Mon, Sep 16, 2013 at 8:12 AM, Parag Patel wrote:

>  RF=3.  Single dc deployment.  No v-nodes.
>
> ** **
>
> Is there a certain amount of time I need to wait from the time the down
> node is started to the point where it’s ready to be used?  If so, what’s
> that time?  If it’s dynamic, how would I know when it’s ready?
>
> ** **
>
> Thanks,
>
> Parag
>
> ** **
>
> *From:* sankalp kohli [mailto:kohlisank...@gmail.com]
> *Sent:* Sunday, September 15, 2013 4:52 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Read query slows down when a node goes down
>
> ** **
>
> What is your replication factor? DO you have multi-DC deployment? Also are
> u using v nodes?
>
> ** **
>
> On Sun, Sep 15, 2013 at 7:54 AM, Parag Patel 
> wrote:
>
> Hi,
>
>  
>
> We have a six node cluster running DataStax Community Edition 1.2.9.  From
> our app, we use the Netflix Astyanax library to read and write records into
> our cluster.  We read and write with QUARUM.  We’re experiencing an issue
> where when a node goes down, we see our read queries slowing down in our
> app whenever a node goes offline.  This is a problem that is very
> reproducible.  Has anybody experienced this before or do people have
> suggestions on what I could try?
>
>  
>
> Thanks,
>
> Parag
>
> ** **
>


RE: Read query slows down when a node goes down

2013-09-16 Thread Parag Patel
Thanks.  I've noticed that a repair takes a long to time to finish.  My data is 
quite small, 1.5GB on each node when running nodetool status.  Is there anyway 
to speed up repairs? (FYI, I haven't actually seen a repair finish since it 
didn't retrun after 10 mins - I figured I was doing something wrong).

From: sankalp kohli [mailto:kohlisank...@gmail.com]
Sent: Monday, September 16, 2013 1:10 PM
To: user@cassandra.apache.org
Subject: Re: Read query slows down when a node goes down

For how long does the read latencies go up once a machine is down? It takes a 
configurable amount of time for machines to detect that another machine is 
down. This is done through Gossip. The algo to detect failures is The Phi 
accrual failure detector.

Regarding your question, if you are bootstrapping then it need to get the data 
from other nodes and during this time, it will not serve any reads but will 
accept writes. Once it has all the data, it will start serving reads. In the 
logs it will have something like "now serving reads"
.
If you are bringing back a machine which is offline, then it will start 
accepting reads and writes immediately but then you should run a repair to get 
the missing data.





On Mon, Sep 16, 2013 at 8:12 AM, Parag Patel 
mailto:parag.pa...@fusionts.com>> wrote:
RF=3.  Single dc deployment.  No v-nodes.

Is there a certain amount of time I need to wait from the time the down node is 
started to the point where it's ready to be used?  If so, what's that time?  If 
it's dynamic, how would I know when it's ready?

Thanks,
Parag

From: sankalp kohli 
[mailto:kohlisank...@gmail.com<mailto:kohlisank...@gmail.com>]
Sent: Sunday, September 15, 2013 4:52 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Read query slows down when a node goes down

What is your replication factor? DO you have multi-DC deployment? Also are u 
using v nodes?

On Sun, Sep 15, 2013 at 7:54 AM, Parag Patel 
mailto:parag.pa...@fusionts.com>> wrote:
Hi,

We have a six node cluster running DataStax Community Edition 1.2.9.  From our 
app, we use the Netflix Astyanax library to read and write records into our 
cluster.  We read and write with QUARUM.  We're experiencing an issue where 
when a node goes down, we see our read queries slowing down in our app whenever 
a node goes offline.  This is a problem that is very reproducible.  Has anybody 
experienced this before or do people have suggestions on what I could try?

Thanks,
Parag




RE: Read query slows down when a node goes down

2013-09-16 Thread Parag Patel
RF=3.  Single dc deployment.  No v-nodes.

Is there a certain amount of time I need to wait from the time the down node is 
started to the point where it's ready to be used?  If so, what's that time?  If 
it's dynamic, how would I know when it's ready?

Thanks,
Parag

From: sankalp kohli [mailto:kohlisank...@gmail.com]
Sent: Sunday, September 15, 2013 4:52 PM
To: user@cassandra.apache.org
Subject: Re: Read query slows down when a node goes down

What is your replication factor? DO you have multi-DC deployment? Also are u 
using v nodes?

On Sun, Sep 15, 2013 at 7:54 AM, Parag Patel 
mailto:parag.pa...@fusionts.com>> wrote:
Hi,

We have a six node cluster running DataStax Community Edition 1.2.9.  From our 
app, we use the Netflix Astyanax library to read and write records into our 
cluster.  We read and write with QUARUM.  We're experiencing an issue where 
when a node goes down, we see our read queries slowing down in our app whenever 
a node goes offline.  This is a problem that is very reproducible.  Has anybody 
experienced this before or do people have suggestions on what I could try?

Thanks,
Parag



Re: Read query slows down when a node goes down

2013-09-15 Thread sankalp kohli
What is your replication factor? DO you have multi-DC deployment? Also are
u using v nodes?


On Sun, Sep 15, 2013 at 7:54 AM, Parag Patel wrote:

>  Hi,
>
> ** **
>
> We have a six node cluster running DataStax Community Edition 1.2.9.  From
> our app, we use the Netflix Astyanax library to read and write records into
> our cluster.  We read and write with QUARUM.  We’re experiencing an issue
> where when a node goes down, we see our read queries slowing down in our
> app whenever a node goes offline.  This is a problem that is very
> reproducible.  Has anybody experienced this before or do people have
> suggestions on what I could try?
>
> ** **
>
> Thanks,
>
> Parag
>


Read query slows down when a node goes down

2013-09-15 Thread Parag Patel
Hi,

We have a six node cluster running DataStax Community Edition 1.2.9.  From our 
app, we use the Netflix Astyanax library to read and write records into our 
cluster.  We read and write with QUARUM.  We're experiencing an issue where 
when a node goes down, we see our read queries slowing down in our app whenever 
a node goes offline.  This is a problem that is very reproducible.  Has anybody 
experienced this before or do people have suggestions on what I could try?

Thanks,
Parag


Re: CQL3 query

2013-07-30 Thread baskar.duraikannu.db
SlicePredicate only support “N” columns. So, you need to query one facet at a 
time OR you can query m columns such that it returns n revisions.  You may need 
intelligence to increase or decrease m columns heuristically.



From: ravi prasad
Sent: ‎Tuesday‎, ‎July‎ ‎30‎, ‎2013 ‎8‎:‎11‎ ‎PM
To: cassandramailinglist



Hi,

 

I have a data modelling question.  I'm modelling for an use case where, an 
object can have multiple facets and each facet can have multiple revisions and 
the query pattern looks like "get latest 'n' revisions for all facets for an 
object (n=1,2,3)".   With a table like below:





create table object (
id uuid,

facet text,

revision timeuuid


value text

primary key (id,facet,revision) #clustered on facet and revision





Does CQL3 support querying efficiently  'get latest revision for all facets 
where id='something';  in a single query.   What would the query look like?





Thanks,

-Ravi

CQL3 query

2013-07-30 Thread ravi prasad
Hi,
 
I have a data modelling question.  I'm modelling for an use case where, an 
object can have multiple facets and each facet can have multiple revisions and 
the query pattern looks like "get latest 'n' revisions for all facets for an 
object (n=1,2,3)".   With a table like below:


create table object (
id uuid,
facet text,
revision timeuuid

value text
primary key (id,facet,revision) #clustered on facet and revision


Does CQL3 support querying efficiently  'get latest revision for all facets 
where id='something';  in a single query.   What would the query look like?


Thanks,
-Ravi


Re: funnel analytics, how to query for reports etc.

2013-07-24 Thread aaron morton
> Too bad Rainbird isn't open sourced yet!
It's been 2 years, I would not hold your breath. 

Remembered there are two time series open source projects out there
https://github.com/deanhiller/databus
https://github.com/Pardot/Rhombus

Cheers


-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 24/07/2013, at 4:00 AM, S Ahmed  wrote:

> Thanks Aaron.
> 
> Too bad Rainbird isn't open sourced yet!
> 
> 
> On Tue, Jul 23, 2013 at 4:48 AM, aaron morton  wrote:
> For background on rollup analytics:
> 
> Twitter Rainbird  
> http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
> Acunu http://www.acunu.com/
> 
> Cheers
> 
> -
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 22/07/2013, at 1:03 AM, Vladimir Prudnikov  wrote:
> 
> > This can be done easily,
> >
> > Use normal column family to store the sequence of events where key is 
> > session #ID identifying one use interaction with a website, column names 
> > are TimeUUID values and column value id of the event (do not write 
> > something like "user added product to shopping cart", something shorter 
> > identifying this event).
> >
> > Then you can use counter column family to store counters, you can count 
> > anything, number of sessions, total number of events, number of particular 
> > events etc. One row per day for example. Then you can retrieve this row and 
> > calculate all required %.
> >
> >
> > On Sun, Jul 21, 2013 at 1:05 AM, S Ahmed  wrote:
> > Would cassandra be a good choice for creating a funnel analytics type 
> > product similar to mixpanel?
> >
> > e.g.  You create a set of events and store them in cassandra for things 
> > like:
> >
> > event#1 user visited product page
> > event#2 user added product to shopping cart
> > event#3 user clicked on checkout page
> > event#4 user filled out cc information
> > event#5 user purchased product
> >
> > Now in my web application I track each user and store the events somehow in 
> > cassandra (in some column family etc)
> >
> > Now how will I pull a report that produces results like:
> >
> > 70% of people added to shopping cart
> > 20% checkout page
> > 10% filled out cc information
> > 4% purchased the product
> >
> >
> > And this is for a Saas, so this report would be for thousands of customers 
> > in theory.
> >
> >
> >
> > --
> > Vladimir Prudnikov
> 
> 



Re: funnel analytics, how to query for reports etc.

2013-07-23 Thread S Ahmed
Thanks Aaron.

Too bad Rainbird isn't open sourced yet!


On Tue, Jul 23, 2013 at 4:48 AM, aaron morton wrote:

> For background on rollup analytics:
>
> Twitter Rainbird
> http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
> Acunu http://www.acunu.com/
>
> Cheers
>
> -
> Aaron Morton
> Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 22/07/2013, at 1:03 AM, Vladimir Prudnikov 
> wrote:
>
> > This can be done easily,
> >
> > Use normal column family to store the sequence of events where key is
> session #ID identifying one use interaction with a website, column names
> are TimeUUID values and column value id of the event (do not write
> something like "user added product to shopping cart", something shorter
> identifying this event).
> >
> > Then you can use counter column family to store counters, you can count
> anything, number of sessions, total number of events, number of particular
> events etc. One row per day for example. Then you can retrieve this row and
> calculate all required %.
> >
> >
> > On Sun, Jul 21, 2013 at 1:05 AM, S Ahmed  wrote:
> > Would cassandra be a good choice for creating a funnel analytics type
> product similar to mixpanel?
> >
> > e.g.  You create a set of events and store them in cassandra for things
> like:
> >
> > event#1 user visited product page
> > event#2 user added product to shopping cart
> > event#3 user clicked on checkout page
> > event#4 user filled out cc information
> > event#5 user purchased product
> >
> > Now in my web application I track each user and store the events somehow
> in cassandra (in some column family etc)
> >
> > Now how will I pull a report that produces results like:
> >
> > 70% of people added to shopping cart
> > 20% checkout page
> > 10% filled out cc information
> > 4% purchased the product
> >
> >
> > And this is for a Saas, so this report would be for thousands of
> customers in theory.
> >
> >
> >
> > --
> > Vladimir Prudnikov
>
>


Re: funnel analytics, how to query for reports etc.

2013-07-23 Thread aaron morton
For background on rollup analytics:

Twitter Rainbird  
http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
Acunu http://www.acunu.com/

Cheers

-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/07/2013, at 1:03 AM, Vladimir Prudnikov  wrote:

> This can be done easily,
> 
> Use normal column family to store the sequence of events where key is session 
> #ID identifying one use interaction with a website, column names are TimeUUID 
> values and column value id of the event (do not write something like "user 
> added product to shopping cart", something shorter identifying this event).
> 
> Then you can use counter column family to store counters, you can count 
> anything, number of sessions, total number of events, number of particular 
> events etc. One row per day for example. Then you can retrieve this row and 
> calculate all required %.
> 
> 
> On Sun, Jul 21, 2013 at 1:05 AM, S Ahmed  wrote:
> Would cassandra be a good choice for creating a funnel analytics type product 
> similar to mixpanel?
> 
> e.g.  You create a set of events and store them in cassandra for things like:
> 
> event#1 user visited product page
> event#2 user added product to shopping cart
> event#3 user clicked on checkout page
> event#4 user filled out cc information
> event#5 user purchased product
> 
> Now in my web application I track each user and store the events somehow in 
> cassandra (in some column family etc)
> 
> Now how will I pull a report that produces results like:
> 
> 70% of people added to shopping cart
> 20% checkout page
> 10% filled out cc information
> 4% purchased the product
> 
> 
> And this is for a Saas, so this report would be for thousands of customers in 
> theory.
> 
> 
> 
> -- 
> Vladimir Prudnikov



Re: Huge query Cassandra limits

2013-07-21 Thread aaron morton
> .The combination was performing better was querying for 500 rows at a time 
> with 1000 columns while different combinations, such as 125 rows for 4000 
> columns or 1000 rows for 500 columns, were about the 15% slower. 
I would rarely go above 100 rows, specially if you are asking for 1000 columns.

> If you consider it depends also on the number of nodes in the cluster, the 
> memory available  and the number of rows and column the query needs, the 
> problem of how  optimally divide a request  becomes quite  complex. 

It sounds like you are targeting single read thread performance. 
If you want to go faster make your client do smaller requests in parallel. 

Cheers

-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/07/2013, at 12:26 AM, cesare cugnasco  wrote:

> Thank you Aaron,  your advice about a newer client it is really interesting. 
> We will take in account it!
> 
> Here, some numbers about our tests: we found that more or less that with more 
> than 500k elements (multiplying rows and columns requested) there was the 
> inflection point, and so asking for more the performance can only 
> decrease.The combination was performing better was querying for 500 rows at a 
> time with 1000 columns while different combinations, such as 125 rows for 
> 4000 columns or 1000 rows for 500 columns, were about the 15% slower. Other 
> combinations have even bigger differences.
> 
> It was a cluster of 16 nodes, with 24GBs or ram, sata-2 SSDs and 8-cores 
> CPUs@2.6 GHz.
> 
> The issue is this memory limit can be reached with many combinations of row 
> and columns. Broadly speaking, in using more rows or columns there is a 
> trade-off between better having a better parallelization and and higher 
> overhead. 
> If you consider it depends also on the number of nodes in the cluster, the 
> memory available  and the number of rows and column the query needs, the 
> problem of how  optimally divide a request  becomes quite  complex. 
>  
> Does these numbers make sense for you?
> 
> Cheers
> 
> 
> 2013/7/17 aaron morton 
> >  In ours tests,  we found there's a significant performance difference 
> > between various  configurations and we are studying a policy to optimize 
> > it. The doubt is that, if the needing of issuing multiple requests is 
> > caused only by a fixable implementation detail, would make pointless do 
> > this study.
> if you provide your numbers we can see if you are getting expected results.
> 
> There are some limiting factors. Using the thrift API the max message size is 
> 15 MB. And each row you ask for becomes (roughly) RF number of tasks in the 
> thread pools on replicas. When you ask for 1000 rows it creates (roughly) 
> 3,000 tasks in the replicas. If you have other clients trying to do reads at 
> the same time this can cause delays to their reads.
> 
> Like everything in computing, more is not always better. Run some tests to 
> try multi gets with different sizes and see where improvements in the overall 
> throughput begin to decline.
> 
> Also consider using a newer client with token aware balancing and async 
> networking. Again though, if you try to read everything at once you are going 
> to have a bad day.
> 
> Cheers
> 
> -
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/07/2013, at 8:24 PM, cesare cugnasco  wrote:
> 
> > Hi Rob,
> > of course, we could issue multiple requests, but then we should  consider 
> > which is the optimal way to split the query in smaller ones. Moreover, we 
> > should choose how many of sub-query run in parallel.
> >  In ours tests,  we found there's a significant performance difference 
> > between various  configurations and we are studying a policy to optimize 
> > it. The doubt is that, if the needing of issuing multiple requests is 
> > caused only by a fixable implementation detail, would make pointless do 
> > this study.
> >
> > Does anyone made similar analysis?
> >
> >
> > 2013/7/16 Robert Coli 
> >
> > On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco 
> >  wrote:
> > We  are working on porting some life science applications to Cassandra, but 
> > we have to deal with its limits managing huge queries. Our queries are 
> > usually multiget_slice ones: many rows with many columns each.
> >
> > You are not getting much "win" by increasing request size in Cassandra, and 
> > you expose yourself to "lose" such as you have experienced.
> >
> > Is there some reason you cannot just issue multiple requests?
> >
> > =Rob
> >
> 
> 



Re: funnel analytics, how to query for reports etc.

2013-07-21 Thread Vladimir Prudnikov
This can be done easily,

Use normal column family to store the sequence of events where key is
session #ID identifying one use interaction with a website, column names
are TimeUUID values and column value id of the event (do not write
something like "user added product to shopping cart", something shorter
identifying this event).

Then you can use counter column family to store counters, you can count
anything, number of sessions, total number of events, number of particular
events etc. One row per day for example. Then you can retrieve this row and
calculate all required %.


On Sun, Jul 21, 2013 at 1:05 AM, S Ahmed  wrote:

> Would cassandra be a good choice for creating a funnel analytics type
> product similar to mixpanel?
>
> e.g.  You create a set of events and store them in cassandra for things
> like:
>
> event#1 user visited product page
> event#2 user added product to shopping cart
> event#3 user clicked on checkout page
> event#4 user filled out cc information
> event#5 user purchased product
>
> Now in my web application I track each user and store the events somehow
> in cassandra (in some column family etc)
>
> Now how will I pull a report that produces results like:
>
> 70% of people added to shopping cart
> 20% checkout page
> 10% filled out cc information
> 4% purchased the product
>
>
> And this is for a Saas, so this report would be for thousands of customers
> in theory.
>



-- 
Vladimir Prudnikov


funnel analytics, how to query for reports etc.

2013-07-20 Thread S Ahmed
Would cassandra be a good choice for creating a funnel analytics type
product similar to mixpanel?

e.g.  You create a set of events and store them in cassandra for things
like:

event#1 user visited product page
event#2 user added product to shopping cart
event#3 user clicked on checkout page
event#4 user filled out cc information
event#5 user purchased product

Now in my web application I track each user and store the events somehow in
cassandra (in some column family etc)

Now how will I pull a report that produces results like:

70% of people added to shopping cart
20% checkout page
10% filled out cc information
4% purchased the product


And this is for a Saas, so this report would be for thousands of customers
in theory.


Re: Huge query Cassandra limits

2013-07-18 Thread cesare cugnasco
Thank you Aaron,  your advice about a newer client it is really
interesting. We will take in account it!

Here, some numbers about our tests: we found that more or less that with
more than 500k elements (multiplying rows and columns requested) there was
the inflection point, and so asking for more the performance can only
decrease. The combination was performing better was querying for 500 rows
at a time with 1000 columns while different combinations, such as 125 rows
for 4000 columns or 1000 rows for 500 columns, were about the 15% slower.
Other combinations have even bigger differences.

It was a cluster of 16 nodes, with 24GBs or ram, sata-2 SSDs and 8-cores
CPUs@2.6 GHz.

The issue is this memory limit can be reached with many combinations of row
and columns. Broadly speaking, in using more rows or columns there is a
trade-off between better having a better parallelization and and higher
overhead.
If you consider it depends also on the number of nodes in the cluster, the
memory available  and the number of rows and column the query needs, the
problem of how  optimally divide a request  becomes quite  complex.

Does these numbers make sense for you?

Cheers


2013/7/17 aaron morton 

> >  In ours tests,  we found there's a significant performance difference
> between various  configurations and we are studying a policy to optimize
> it. The doubt is that, if the needing of issuing multiple requests is
> caused only by a fixable implementation detail, would make pointless do
> this study.
> if you provide your numbers we can see if you are getting expected results.
>
> There are some limiting factors. Using the thrift API the max message size
> is 15 MB. And each row you ask for becomes (roughly) RF number of tasks in
> the thread pools on replicas. When you ask for 1000 rows it creates
> (roughly) 3,000 tasks in the replicas. If you have other clients trying to
> do reads at the same time this can cause delays to their reads.
>
> Like everything in computing, more is not always better. Run some tests to
> try multi gets with different sizes and see where improvements in the
> overall throughput begin to decline.
>
> Also consider using a newer client with token aware balancing and async
> networking. Again though, if you try to read everything at once you are
> going to have a bad day.
>
> Cheers
>
> -
> Aaron Morton
> Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/07/2013, at 8:24 PM, cesare cugnasco 
> wrote:
>
> > Hi Rob,
> > of course, we could issue multiple requests, but then we should
>  consider which is the optimal way to split the query in smaller ones.
> Moreover, we should choose how many of sub-query run in parallel.
> >  In ours tests,  we found there's a significant performance difference
> between various  configurations and we are studying a policy to optimize
> it. The doubt is that, if the needing of issuing multiple requests is
> caused only by a fixable implementation detail, would make pointless do
> this study.
> >
> > Does anyone made similar analysis?
> >
> >
> > 2013/7/16 Robert Coli 
> >
> > On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco <
> cesare.cugna...@gmail.com> wrote:
> > We  are working on porting some life science applications to Cassandra,
> but we have to deal with its limits managing huge queries. Our queries are
> usually multiget_slice ones: many rows with many columns each.
> >
> > You are not getting much "win" by increasing request size in Cassandra,
> and you expose yourself to "lose" such as you have experienced.
> >
> > Is there some reason you cannot just issue multiple requests?
> >
> > =Rob
> >
>
>


Re: Huge query Cassandra limits

2013-07-17 Thread aaron morton
>  In ours tests,  we found there's a significant performance difference 
> between various  configurations and we are studying a policy to optimize it. 
> The doubt is that, if the needing of issuing multiple requests is caused only 
> by a fixable implementation detail, would make pointless do this study.
if you provide your numbers we can see if you are getting expected results. 

There are some limiting factors. Using the thrift API the max message size is 
15 MB. And each row you ask for becomes (roughly) RF number of tasks in the 
thread pools on replicas. When you ask for 1000 rows it creates (roughly) 3,000 
tasks in the replicas. If you have other clients trying to do reads at the same 
time this can cause delays to their reads. 

Like everything in computing, more is not always better. Run some tests to try 
multi gets with different sizes and see where improvements in the overall 
throughput begin to decline. 

Also consider using a newer client with token aware balancing and async 
networking. Again though, if you try to read everything at once you are going 
to have a bad day.

Cheers
  
-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/07/2013, at 8:24 PM, cesare cugnasco  wrote:

> Hi Rob,
> of course, we could issue multiple requests, but then we should  consider 
> which is the optimal way to split the query in smaller ones. Moreover, we 
> should choose how many of sub-query run in parallel.
>  In ours tests,  we found there's a significant performance difference 
> between various  configurations and we are studying a policy to optimize it. 
> The doubt is that, if the needing of issuing multiple requests is caused only 
> by a fixable implementation detail, would make pointless do this study.
> 
> Does anyone made similar analysis?
> 
> 
> 2013/7/16 Robert Coli 
> 
> On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco  
> wrote:
> We  are working on porting some life science applications to Cassandra, but 
> we have to deal with its limits managing huge queries. Our queries are 
> usually multiget_slice ones: many rows with many columns each.
> 
> You are not getting much "win" by increasing request size in Cassandra, and 
> you expose yourself to "lose" such as you have experienced.
> 
> Is there some reason you cannot just issue multiple requests?
> 
> =Rob 
> 



Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-17 Thread aaron morton
> This would seem to conflict with the advice to only use secondary indexes on 
> fields with low cardinality, not high cardinality. I guess low cardinality is 
> good, as long as it isn't /too/ low? 
My concern is seeing people in the wild create secondary indexes with low 
cardinality that generate huge rows. 

Also with how selective indexes are, for background see "Create 
Highly-Selective Indexes" 
http://msdn.microsoft.com/en-nz/library/ms172984(v=sql.100).aspx

if you index 100 rows with a low cardinality, say there are only 10 unique 
values, then you have 10 index rows with 10 entries each. Using "Selectivity is 
the ratio of qualifying rows to total rows." from the article it's at 1:10 
ratio. If you have 50 unique values, you have 50 rows with 2 values each so the 
ratio is 1:50. The second is more selective and more useful. 
 
Indexing 20 million rows that all have "foo" == "bar" is not very useful. 

Cheers

-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/07/2013, at 10:52 AM, Tristan Seligmann  wrote:

> On Mon, Jul 15, 2013 at 12:26 AM, aaron morton  
> wrote:
>> Aaron Morton can confirm but I think one problem could be that to create an 
>> index on a field with small number of possible values is not good.
> Yes.
> In cassandra each value in the index becomes a single row in the internal 
> secondary index CF. You will end up with a huge row for all the values with 
> false. 
> 
> And in general, if you want a queue you should use a queue. 
> 
> This would seem to conflict with the advice to only use secondary indexes on 
> fields with low cardinality, not high cardinality. I guess low cardinality is 
> good, as long as it isn't /too/ low? 
> -- 
> mithrandi, i Ainil en-Balandor, a faer Ambar



Re: Huge query Cassandra limits

2013-07-17 Thread cesare cugnasco
Hi Rob,
of course, we could issue multiple requests, but then we should  consider
which is the optimal way to split the query in smaller ones. Moreover, we
should choose how many of sub-query run in parallel.

 In ours tests,  we found there's a significant performance difference
between various  configurations and we are studying a policy to optimize
it. The doubt is that, if the needing of issuing multiple requests is
caused only by a fixable implementation detail, would make pointless do
this study.

Does anyone made similar analysis?


2013/7/16 Robert Coli 

>
> On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco <
> cesare.cugna...@gmail.com> wrote:
>
>> We  are working on porting some life science applications to Cassandra,
>> but we have to deal with its limits managing huge queries. Our queries are
>> usually multiget_slice ones: many rows with many columns each.
>>
>
> You are not getting much "win" by increasing request size in Cassandra,
> and you expose yourself to "lose" such as you have experienced.
>
> Is there some reason you cannot just issue multiple requests?
>
> =Rob
>


Re: Huge query Cassandra limits

2013-07-16 Thread Robert Coli
On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco
wrote:

> We  are working on porting some life science applications to Cassandra,
> but we have to deal with its limits managing huge queries. Our queries are
> usually multiget_slice ones: many rows with many columns each.
>

You are not getting much "win" by increasing request size in Cassandra, and
you expose yourself to "lose" such as you have experienced.

Is there some reason you cannot just issue multiple requests?

=Rob


Huge query Cassandra limits

2013-07-16 Thread cesare cugnasco
Hi everybody,

We  are working on porting some life science applications to Cassandra, but
we have to deal with its limits managing huge queries. Our queries are
usually multiget_slice ones: many rows with many columns each.

We have seen system start to slower until the entry point node crashes when
increasing the amount of rows and columns required in a single query .

Where does this limit come from?

Giving a fast look to the code seems like the entry point is stressed
because it has to keep all the responses in memory. Only after it has
received all the responses, from the nodes, Then, it resolves the conflicts
between different versions and sends them to the client.

Would not be possible to start sending the response to the client before
receiving all them? For instance, it may speed up the  Consistency level
ONE queries.

Does anyone worked on it? Is it the real reason of the decreasing of
performances?


Thank you,

Cesare Cugnasco


Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-15 Thread Nate McCall
Couple of questions about the test setup:
- are you running the tests in parallel (via threadCount in surefire
or failsafe for example?)
- is the instance of cassandra per-class for per jvm? (or is fork=true?)


On Sun, Jul 14, 2013 at 5:52 PM, Tristan Seligmann
 wrote:
> On Mon, Jul 15, 2013 at 12:26 AM, aaron morton 
> wrote:
>>
>> Aaron Morton can confirm but I think one problem could be that to create
>> an index on a field with small number of possible values is not good.
>>
>> Yes.
>> In cassandra each value in the index becomes a single row in the internal
>> secondary index CF. You will end up with a huge row for all the values with
>> false.
>>
>> And in general, if you want a queue you should use a queue.
>
>
> This would seem to conflict with the advice to only use secondary indexes on
> fields with low cardinality, not high cardinality. I guess low cardinality
> is good, as long as it isn't /too/ low?
> --
> mithrandi, i Ainil en-Balandor, a faer Ambar


Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-14 Thread Tristan Seligmann
On Mon, Jul 15, 2013 at 12:26 AM, aaron morton wrote:

> Aaron Morton can confirm but I think one problem could be that to create
> an index on a field with small number of possible values is not good.
>
> Yes.
> In cassandra each value in the index becomes a single row in the internal
> secondary index CF. You will end up with a huge row for all the values with
> false.
>
> And in general, if you want a queue you should use a queue.
>

This would seem to conflict with the advice to only use secondary indexes
on fields with low cardinality, not high cardinality. I guess low
cardinality is good, as long as it isn't /too/ low?
-- 
mithrandi, i Ainil en-Balandor, a faer Ambar


Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-14 Thread aaron morton
> Aaron Morton can confirm but I think one problem could be that to create an 
> index on a field with small number of possible values is not good.
Yes.
In cassandra each value in the index becomes a single row in the internal 
secondary index CF. You will end up with a huge row for all the values with 
false. 

And in general, if you want a queue you should use a queue. 

Cheers
 
-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/07/2013, at 1:51 AM, Shahab Yunus  wrote:

> Aaron Morton can confirm but I think one problem could be that to create an 
> index on a field with small number of possible values is not good.
> 
> Regards,
> Shahab
> 
> 
> On Sat, Jul 13, 2013 at 9:14 AM, Tristan Seligmann  
> wrote:
> On Fri, Jul 12, 2013 at 10:38 AM, aaron morton  
> wrote:
>> CREATE INDEX ON conv_msgdata_by_participant_cql(msgReadFlag);
> On general this is a bad idea in Cassandra (also in a relational DB IMHO). 
> You will get poor performance from it. 
> 
> Could you elaborate on why this is a bad idea? 
> -- 
> mithrandi, i Ainil en-Balandor, a faer Ambar
> 



Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-13 Thread Shahab Yunus
Aaron Morton can confirm but I think one problem could be that to create an
index on a field with small number of possible values is not good.

Regards,
Shahab


On Sat, Jul 13, 2013 at 9:14 AM, Tristan Seligmann
wrote:

> On Fri, Jul 12, 2013 at 10:38 AM, aaron morton wrote:
>
>> CREATE INDEX ON conv_msgdata_by_participant_cql(msgReadFlag);
>>
>> On general this is a bad idea in Cassandra (also in a relational DB
>> IMHO). You will get poor performance from it.
>>
>
> Could you elaborate on why this is a bad idea?
> --
> mithrandi, i Ainil en-Balandor, a faer Ambar
>


Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-13 Thread Tristan Seligmann
On Fri, Jul 12, 2013 at 10:38 AM, aaron morton wrote:

> CREATE INDEX ON conv_msgdata_by_participant_cql(msgReadFlag);
>
> On general this is a bad idea in Cassandra (also in a relational DB IMHO).
> You will get poor performance from it.
>

Could you elaborate on why this is a bad idea?
-- 
mithrandi, i Ainil en-Balandor, a faer Ambar


Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-12 Thread aaron morton
> The “ALLOW FILTERING” clause also has no effect.
You only need that when the WHERE clause contains predicates for columns that 
are not part of the primary key. 

> CREATE INDEX ON conv_msgdata_by_participant_cql(msgReadFlag);
On general this is a bad idea in Cassandra (also in a relational DB IMHO). You 
will get poor performance from it. 

> Caused by: java.lang.IllegalArgumentException
> at java.nio.Buffer.limit(Buffer.java:247)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:51)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:78)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader$BlockFetcher.isColumnBeforeSliceFinish(IndexedSliceReader.java:216)
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader$SimpleBlockFetcher.(IndexedSliceReader.java:450)
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:85)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:68)
This looks like an error in the on disk data, or maybe in passing the value for 
the messageId value but I doubt it. 

What version are you using ? 
Can you reproduce this outside of your unit tests ?

Cheers

-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/07/2013, at 12:40 AM, "Pruner, Anne (Anne)"  wrote:

> Hi,
> I’ve been tearing my hair out trying to figure out why this 
> query fails.  In fact, it only fails on machines with slower CPUs and after 
> having previously run some other junit tests.  I’m running junits to an 
> embedded Cassandra server, which works well in pretty much all other cases, 
> but this one is flaky.  I’ve tried to rule out timing issues by placing a 10 
> second delay just before this query, just in case somehow the data isn’t 
> getting into the db in a timely manner, but that doesn’t have any effect.  
> I’ve also tried removing the “ORDER BY” clause, which seems to be the place 
> in the code it’s getting hung up on, but that also doesn’t have any effect.  
> The “ALLOW FILTERING” clause also has no effect.
>  
> DEBUG [Native-Transport-Requests:16] 2013-07-10 16:28:21,993 Message.java 
> (line 277) Received: QUERY SELECT * FROM conv_msgdata_by_participant_cql 
> WHEREentityConversationId='bulktestfromus...@test.ca&contact_811b5efc-b621-4361-9dc9-2e4755be7d89'
>  AND messageId<'2013-07-10T20:29:09.773Zzz' ORDER BY messageId DESC LIMIT 
> 15 ALLOW FILTERING;
> ERROR [ReadStage:34] 2013-07-10 16:28:21,995 CassandraDaemon.java (line 132) 
> Exception in thread Thread[ReadStage:34,5,main]
> java.lang.RuntimeException: java.lang.IllegalArgumentException
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1582)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.IllegalArgumentException
> at java.nio.Buffer.limit(Buffer.java:247)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:51)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:78)
> at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader$BlockFetcher.isColumnBeforeSliceFinish(IndexedSliceReader.java:216)
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader$SimpleBlockFetcher.(IndexedSliceReader.java:450)
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:85)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:68)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:44)
> at 
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:101)
> at 
> org.apache.cassandra.db.filter.QueryFilter.getS

IllegalArgumentException on query with AbstractCompositeType

2013-07-11 Thread Pruner, Anne (Anne)
Hi,
I've been tearing my hair out trying to figure out why this 
query fails.  In fact, it only fails on machines with slower CPUs and after 
having previously run some other junit tests.  I'm running junits to an 
embedded Cassandra server, which works well in pretty much all other cases, but 
this one is flaky.  I've tried to rule out timing issues by placing a 10 second 
delay just before this query, just in case somehow the data isn't getting into 
the db in a timely manner, but that doesn't have any effect.  I've also tried 
removing the "ORDER BY" clause, which seems to be the place in the code it's 
getting hung up on, but that also doesn't have any effect.  The "ALLOW 
FILTERING" clause also has no effect.

DEBUG [Native-Transport-Requests:16] 2013-07-10 16:28:21,993 Message.java (line 
277) Received: QUERY SELECT * FROM conv_msgdata_by_participant_cql WHERE 
entityConversationId='bulktestfromus...@test.ca&contact_811b5efc-b621-4361-9dc9-2e4755be7d89'
 AND messageId<'2013-07-10T20:29:09.773Zzz' ORDER BY messageId DESC LIMIT 
15 ALLOW FILTERING;
ERROR [ReadStage:34] 2013-07-10 16:28:21,995 CassandraDaemon.java (line 132) 
Exception in thread Thread[ReadStage:34,5,main]
java.lang.RuntimeException: java.lang.IllegalArgumentException
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1582)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:247)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:51)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:78)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31)
at 
org.apache.cassandra.db.columniterator.IndexedSliceReader$BlockFetcher.isColumnBeforeSliceFinish(IndexedSliceReader.java:216)
at 
org.apache.cassandra.db.columniterator.IndexedSliceReader$SimpleBlockFetcher.(IndexedSliceReader.java:450)
at 
org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:85)
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:68)
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:44)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:101)
at 
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:68)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:275)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
at org.apache.cassandra.db.Table.getRow(Table.java:355)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)

Here's the table it's querying from:

CREATE TABLE conv_msgdata_by_participant_cql (
entityConversationId text,
messageId text,
jsonMessage text,
msgReadFlag boolean,
msgReadDate text,
PRIMARY KEY (entityConversationId, messageId)
) ;

CREATE INDEX ON conv_msgdata_by_participant_cql(msgReadFlag);


Any ideas?

Thanks,
Anne


<    3   4   5   6   7   8   9   10   11   12   >