subject:"Query"

Re: Smart Table creation for 2D range query

2017-05-09 Thread Jim Ancona

There are clever ways to encode coordinates into a single scalar value
where points that are close on a surface are also close in value, making
queries efficient. Examples are Geohash
<https://en.wikipedia.org/wiki/Geohash> and Google's S2
<https://docs.google.com/presentation/d/1Hl4KapfAENAOf4gv-pSngKwvS_jwNVHRPZTTDzXXn6Q/view#slide=id.i0>.
As Jon mentions, this puts more work on the client, but might give you a
lot of querying flexibility when using Cassandra.

Jim

On Mon, May 8, 2017 at 11:13 PM, Jon Haddad <jonathan.had...@gmail.com>
wrote:

> It gets a little tricky when you try to add in the coordinates to the
> clustering key if you want to do operations that are more complex.  For
> instance, finding all the elements within a radius of point (x,y) isn’t
> particularly fun with Cassandra.  I recommend moving that logic into the
> application.
>
> > On May 8, 2017, at 10:06 PM, kurt greaves <k...@instaclustr.com> wrote:
> >
> > Note that will not give you the desired range queries of 0 >= x <= 1 and
> 0 >= y <= 1.
> >
> >
> > Something akin to Jon's solution could give you those range queries if
> you made the x and y components part of the clustering key.
> >
> > For example, a space of (1,1) could contain all x,y coordinates where x
> and y are > 0 and <= 1. You would then have a table like:
> >
> > CREATE TABLE geospatial (
> > space text,
> > x double,
> > y double,
> > item text,
> > m1,
> > m2,
> > m3,
> > primary key ((space), x, y, m1, m2, m3, m4, m5)
> > );
> >
> > A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2
> and y>0.1; should yield all x and y pairs and their distinct metadata. Or
> something like that anyway.
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Smart Table creation for 2D range query

2017-05-08 Thread Jon Haddad

It gets a little tricky when you try to add in the coordinates to the 
clustering key if you want to do operations that are more complex.  For 
instance, finding all the elements within a radius of point (x,y) isn’t 
particularly fun with Cassandra.  I recommend moving that logic into the 
application.  

> On May 8, 2017, at 10:06 PM, kurt greaves <k...@instaclustr.com> wrote:
> 
> Note that will not give you the desired range queries of 0 >= x <= 1 and 0 >= 
> y <= 1.
> 
> 
> Something akin to Jon's solution could give you those range queries if you 
> made the x and y components part of the clustering key.
> 
> For example, a space of (1,1) could contain all x,y coordinates where x and y 
> are > 0 and <= 1. You would then have a table like:
> 
> CREATE TABLE geospatial (
> space text,
> x double,
> y double,
> item text,
> m1,
> m2,
> m3,
> primary key ((space), x, y, m1, m2, m3, m4, m5)
> );
> 
> A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2 and 
> y>0.1; should yield all x and y pairs and their distinct metadata. Or 
> something like that anyway.
> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Smart Table creation for 2D range query

2017-05-08 Thread kurt greaves

Note that will not give you the desired range queries of 0 >= x <= 1 and 0
>= y <= 1.


Something akin to Jon's solution could give you those range queries if you
made the x and y components part of the clustering key.

For example, a space of (1,1) could contain all x,y coordinates where x and
y are > 0 and <= 1. You would then have a table like:

CREATE TABLE geospatial (
space text,
x double,
y double,
item text,
m1,
m2,
m3,
primary key ((space), x, y, m1, m2, m3, m4, m5)
);

A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2 and
y>0.1; should yield all x and y pairs and their distinct metadata. Or
something like that anyway.

Re: Smart Table creation for 2D range query

2017-05-08 Thread Anthony Grasso

Hi Lydia,

Yes. This will define the *x*, *y* columns as the components of the
partition key. Note that by doing this both *x* and *y* values will be
required to at a minimum to perform a valid query.

Alternatively, the *x* and *y* values could be combined in into a single
text field as Jon has suggested.

Kind regards,
Anthony

On 7 May 2017 at 17:15, Lydia Ickler <ickle...@googlemail.com> wrote:

> Like this?
>
> CREATE TABLE test (
>   x double,
>   y double,
>   m1 int,
>   ...
>   m5 int,
>   PRIMARY KEY ((x,y), m1, … , m5)
> )
>
>
>
> Am 05.05.2017 um 21:54 schrieb Nitan Kainth <ni...@bamlabs.com>:
>
> Make metadata as partition key and x,y as part of partition key i.e.
> Primary key. It should work
>
> Sent from my iPhone
>
> On May 5, 2017, at 2:40 PM, Lydia <ickle...@googlemail.com> wrote:
>
>
> Hi all,
>
>
> I am new to Apache Cassandra and I would like to get some advice on how to
> tackle a table creation / indexing in a sophisticated way.
>
>
> My aim is to store x- and y-coordinates, accompanied by some columns with
> meta information (m1, ... ,m5). There will be around 100,000,000 rows
> overall. Some rows might have the same (x,y) pairs but always distinct meta
> information.
>
>
> In the end I want to do a rather simple range query in the form of e.g. (0
> >= x <= 1) AND (0 >= y <= 1).
>
>
> What would be the best choice of variables to set as primary key,
> partition key. Or should I use a index? And if so on what column(s)?
>
>
> Thanks in advance!
>
> Best regards,
>
> Lydia
>
> -
>
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>

Re: Smart Table creation for 2D range query

2017-05-07 Thread Lydia Ickler

Like this?

CREATE TABLE test ( x double, y double, m1 int, ... m5 int, PRIMARY KEY ((x,y), 
m1, … , m5) )


> Am 05.05.2017 um 21:54 schrieb Nitan Kainth <ni...@bamlabs.com>:
> 
> Make metadata as partition key and x,y as part of partition key i.e. Primary 
> key. It should work
> 
> Sent from my iPhone
> 
>> On May 5, 2017, at 2:40 PM, Lydia <ickle...@googlemail.com> wrote:
>> 
>> Hi all,
>> 
>> I am new to Apache Cassandra and I would like to get some advice on how to 
>> tackle a table creation / indexing in a sophisticated way.
>> 
>> My aim is to store x- and y-coordinates, accompanied by some columns with 
>> meta information (m1, ... ,m5). There will be around 100,000,000 rows 
>> overall. Some rows might have the same (x,y) pairs but always distinct meta 
>> information. 
>> 
>> In the end I want to do a rather simple range query in the form of e.g. (0 
>> >= x <= 1) AND (0 >= y <= 1).
>> 
>> What would be the best choice of variables to set as primary key, partition 
>> key. Or should I use a index? And if so on what column(s)?
>> 
>> Thanks in advance!
>> Best regards, 
>> Lydia
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>

Re: Smart Table creation for 2D range query

2017-05-05 Thread Jon Haddad

I think you’ll want to model your table similar to how an R-Tree [1] / Quad 
tree [2] works.  Let’s suppose you had a 10x10 meter land area and you wanted 
to put stuff in there.  In order to find “all the things in point x,y”, you 
could break your land area into a grid.  A partition would contain all the 
items that are in that grid space.  In my simple example, I’d have 100 
partitions.

For example:

// space is a simple "x.y" text field
CREATE TABLE geospatial (
space text,
item text,
primary key (space, item)
);

insert into geospatial (space, item) values ('1.1', 'hat');
insert into geospatial (space, item) values ('1.1', 'bird');
insert into geospatial (space, item) values ('6.4', 'dog’);

This example is pretty trivial, and doesn’t take into account hot partitions.  
That’s where the process of subdividing a space occurs when it reaches a 
certain size.

[1] https://en.wikipedia.org/wiki/R-tree <https://en.wikipedia.org/wiki/R-tree>
[2] https://en.wikipedia.org/wiki/Quadtree 
<https://en.wikipedia.org/wiki/Quadtree>
> On May 5, 2017, at 12:54 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
> 
> Make metadata as partition key and x,y as part of partition key i.e. Primary 
> key. It should work
> 
> Sent from my iPhone
> 
>> On May 5, 2017, at 2:40 PM, Lydia <ickle...@googlemail.com> wrote:
>> 
>> Hi all,
>> 
>> I am new to Apache Cassandra and I would like to get some advice on how to 
>> tackle a table creation / indexing in a sophisticated way.
>> 
>> My aim is to store x- and y-coordinates, accompanied by some columns with 
>> meta information (m1, ... ,m5). There will be around 100,000,000 rows 
>> overall. Some rows might have the same (x,y) pairs but always distinct meta 
>> information. 
>> 
>> In the end I want to do a rather simple range query in the form of e.g. (0 
>> >= x <= 1) AND (0 >= y <= 1).
>> 
>> What would be the best choice of variables to set as primary key, partition 
>> key. Or should I use a index? And if so on what column(s)?
>> 
>> Thanks in advance!
>> Best regards, 
>> Lydia
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

Re: Smart Table creation for 2D range query

2017-05-05 Thread Nitan Kainth

Make metadata as partition key and x,y as part of partition key i.e. Primary 
key. It should work

Sent from my iPhone

> On May 5, 2017, at 2:40 PM, Lydia <ickle...@googlemail.com> wrote:
> 
> Hi all,
> 
> I am new to Apache Cassandra and I would like to get some advice on how to 
> tackle a table creation / indexing in a sophisticated way.
> 
> My aim is to store x- and y-coordinates, accompanied by some columns with 
> meta information (m1, ... ,m5). There will be around 100,000,000 rows 
> overall. Some rows might have the same (x,y) pairs but always distinct meta 
> information. 
> 
> In the end I want to do a rather simple range query in the form of e.g. (0 >= 
> x <= 1) AND (0 >= y <= 1).
> 
> What would be the best choice of variables to set as primary key, partition 
> key. Or should I use a index? And if so on what column(s)?
> 
> Thanks in advance!
> Best regards, 
> Lydia
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Smart Table creation for 2D range query

2017-05-05 Thread Lydia

Hi all,

I am new to Apache Cassandra and I would like to get some advice on how to 
tackle a table creation / indexing in a sophisticated way.

My aim is to store x- and y-coordinates, accompanied by some columns with meta 
information (m1, ... ,m5). There will be around 100,000,000 rows overall. Some 
rows might have the same (x,y) pairs but always distinct meta information. 

In the end I want to do a rather simple range query in the form of e.g. (0 >= x 
<= 1) AND (0 >= y <= 1).

What would be the best choice of variables to set as primary key, partition 
key. Or should I use a index? And if so on what column(s)?

Thanks in advance!
Best regards, 
Lydia
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Will query on PK read entire partition?

2017-04-25 Thread Vladimir Yudovin

Hi,



if you provide primary key C* will not scan whole partition, but will bloom 
filter to determinate SSTable:

Cassandra uses Bloom filters to determine whether an SSTable has data for a 
particular row. Bloom filters are unused for range scans, but are used for 
index scans.






Best regards, Vladimir Yudovin, 

Winguzone - Cloud Cassandra Hosting






 On Fri, 21 Apr 2017 07:56:08 -0400 Alain RODRIGUEZ 
arodr...@gmail.com wrote 




Hi Oskar,



My guess (wait for confirmation maybe): When you read from a primary key + 
specific clustering key or (range of clustering keys), Apache Cassandra will 
look for these specific values and not read all the row. Yet it is important to 
know that a minimal block size of 64 KB is read from the disk (not configurable 
in C* 2.0). Or if the table is compressed, the minimal read size is a chunk, 
for which you can manually set the size. That's why when using small rows, it 
is sometimes interesting to enable compression, even if you don't care about 
the data size... This all has been improved a bit in 2.1 / 2.2 and greatly in 
C* 3.0+.



I might write a post about this, if I do, I will let you know. It's an 
interesting topic I have been working on recently.



C*heers,

---

Alain Rodriguez - @arodream - al...@thelastpickle.com

France



The Last Pickle - Apache Cassandra Consulting

http://www.thelastpickle.com










2017-04-21 10:44 GMT+02:00 Oskar Kjellin oskar.kjel...@gmail.com:

If I have a table like this:



PRIMARY KEY ((userid),deviceid)



And I query

SELECT * FROM devices where userid= ? and deviceid = ?



Will cassandra read the entire partition for the userid? So if I lots of 
tombstones for userid, will they get scanned?



I guess this depends on how the bloomfilter is working. Does it contain 
partitioning key or primary key?



We're using 2.0.17 if it matters.



/Oskar

Re: Will query on PK read entire partition?

2017-04-21 Thread Alain RODRIGUEZ

Hi Oskar,

My guess (wait for confirmation maybe): When you read from a primary key +
specific clustering key or (range of clustering keys), Apache Cassandra
will look for these specific values and not read all the row. Yet it is
important to know that a minimal block size of 64 KB is read from the disk
(not configurable in C* 2.0). Or if the table is compressed, the minimal
read size is a chunk, for which you can manually set the size. That's why
when using small rows, it is sometimes interesting to enable compression,
even if you don't care about the data size... This all has been improved a
bit in 2.1 / 2.2 and greatly in C* 3.0+.

I might write a post about this, if I do, I will let you know. It's an
interesting topic I have been working on recently.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



2017-04-21 10:44 GMT+02:00 Oskar Kjellin <oskar.kjel...@gmail.com>:

> If I have a table like this:
>
> PRIMARY KEY ((userid),deviceid)
>
> And I query
> SELECT * FROM devices where userid= ? and deviceid = ?
>
> Will cassandra read the entire partition for the userid? So if I lots of
> tombstones for userid, will they get scanned?
>
> I guess this depends on how the bloomfilter is working. Does it contain
> partitioning key or primary key?
>
> We're using 2.0.17 if it matters.
>
> /Oskar
>

Will query on PK read entire partition?

2017-04-21 Thread Oskar Kjellin

If I have a table like this:

PRIMARY KEY ((userid),deviceid)

And I query
SELECT * FROM devices where userid= ? and deviceid = ?

Will cassandra read the entire partition for the userid? So if I lots of
tombstones for userid, will they get scanned?

I guess this depends on how the bloomfilter is working. Does it contain
partitioning key or primary key?

We're using 2.0.17 if it matters.

/Oskar

Re: Query on Data Modelling of a specific usecase

2017-04-20 Thread Naresh Yadav

Hi Jon,

Thanks for your guidance.

In above mentioned table i can have different scale depending on Report.

One report may have 1 rows.
Second report may have half million rows.
Third report may have 1 million rows.
Fourth report may have 10 million rows.

As this is timeseries data that was main reason of modelling in cassandra.
We preferred separate table for each report as there is no usecase of
quering across reports and also Light reports will work faster.
I can plan to reduce no of tables drastically by combining lighter reports
in one table at application level.

If you can suggest optimal table design keeping one table in mind with 10
millions to 1 billion rows scale for the mentioned queries.

Thanks,
Naresh Yadav

On Wed, Apr 19, 2017 at 9:26 PM, Jon Haddad 
wrote:

> How much data do you plan to store in each table?
>
> I’ll be honest, this doesn’t sound like a Cassandra use case at first
> glance.  1 table per report x 1000 is going to be a bad time.  Odds are
> with different queries, you’ll need multiple views, so lets call that a
> handful of tables per report.  Sounds to me like you need CSV (for small
> reports) or Parquet + a file system (for large ones).
>
> Jon
>
>
> On Apr 18, 2017, at 11:34 PM, Naresh Yadav  wrote:
>
> Looking for cassandra expert's recommendation on above usecase, please
> reply.
>
> On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav 
> wrote:
>
>> Hi all,
>>
>> This is my existing table configured on apache-cassandra-3.0.9:
>>
>> CREATE TABLE report_id1 (
>>mc_id text,
>>tag_id text,
>>e_date timestamp.
>>value text
>>PRIMARY KEY ((mc_id, tag_id), e_date)
>> }
>>
>> I create table dynamically for each report from application. Need to
>> support upto 1000 reports means 1000 such tables.
>> unique mc_id will be in range of 5 to 100 in a report.
>> For a mc_id there will be unique tag_id in range of 100 to 1 million in a
>> report.
>> For a mc_id, tag_id there will be unique e_date values in range of 10 to
>> 5000.
>>
>> Current queries to answer :
>> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
>> e_date='16Apr2017 23:59:59';
>> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
>> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
>>
>> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017
>> 23:59:59';
>>Current design this works with ALLOW FILTERING ONLY
>> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
>> 00:00:00' AND e_date <='16Apr2017 23:59:59';
>>Current design this works with ALLOW FILTERING ONLY
>>
>> Looking for better design for this case, keeping in mind dynamic tables
>> usecase and queries listed.
>>
>> Thanks in advance,
>> Naresh
>>
>>
>
>

Re: Query on Data Modelling of a specific usecase

2017-04-19 Thread Jon Haddad

How much data do you plan to store in each table?

I’ll be honest, this doesn’t sound like a Cassandra use case at first glance.  
1 table per report x 1000 is going to be a bad time.  Odds are with different 
queries, you’ll need multiple views, so lets call that a handful of tables per 
report.  Sounds to me like you need CSV (for small reports) or Parquet + a file 
system (for large ones).

Jon


> On Apr 18, 2017, at 11:34 PM, Naresh Yadav  wrote:
> 
> Looking for cassandra expert's recommendation on above usecase, please reply.
> 
> On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav  > wrote:
> Hi all,
> 
> This is my existing table configured on apache-cassandra-3.0.9:
> 
> CREATE TABLE report_id1 (
>mc_id text,
>tag_id text,
>e_date timestamp.
>value text
>PRIMARY KEY ((mc_id, tag_id), e_date)
> }
> 
> I create table dynamically for each report from application. Need to support 
> upto 1000 reports means 1000 such tables.
> unique mc_id will be in range of 5 to 100 in a report.
> For a mc_id there will be unique tag_id in range of 100 to 1 million in a 
> report.
> For a mc_id, tag_id there will be unique e_date values in range of 10 to 5000.
> 
> Current queries to answer : 
> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND 
> e_date='16Apr2017 23:59:59';
> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND 
> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
> 
> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017 00:00:00' 
> AND e_date <='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
>
> Looking for better design for this case, keeping in mind dynamic tables 
> usecase and queries listed.   
> 
> Thanks in advance,
> Naresh
> 
>

Re: Query on Data Modelling of a specific usecase

2017-04-19 Thread Naresh Yadav

Looking for cassandra expert's recommendation on above usecase, please
reply.

On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav  wrote:

> Hi all,
>
> This is my existing table configured on apache-cassandra-3.0.9:
>
> CREATE TABLE report_id1 (
>mc_id text,
>tag_id text,
>e_date timestamp.
>value text
>PRIMARY KEY ((mc_id, tag_id), e_date)
> }
>
> I create table dynamically for each report from application. Need to
> support upto 1000 reports means 1000 such tables.
> unique mc_id will be in range of 5 to 100 in a report.
> For a mc_id there will be unique tag_id in range of 100 to 1 million in a
> report.
> For a mc_id, tag_id there will be unique e_date values in range of 10 to
> 5000.
>
> Current queries to answer :
> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
> e_date='16Apr2017 23:59:59';
> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
>
> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
> 00:00:00' AND e_date <='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
>
> Looking for better design for this case, keeping in mind dynamic tables
> usecase and queries listed.
>
> Thanks in advance,
> Naresh
>
>

Query on Data Modelling of a specific usecase

2017-04-17 Thread Naresh Yadav

Hi all,

This is my existing table configured on apache-cassandra-3.0.9:

CREATE TABLE report_id1 (
   mc_id text,
   tag_id text,
   e_date timestamp.
   value text
   PRIMARY KEY ((mc_id, tag_id), e_date)
}

I create table dynamically for each report from application. Need to
support upto 1000 reports means 1000 such tables.
unique mc_id will be in range of 5 to 100 in a report.
For a mc_id there will be unique tag_id in range of 100 to 1 million in a
report.
For a mc_id, tag_id there will be unique e_date values in range of 10 to
5000.

Current queries to answer :
1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
e_date='16Apr2017 23:59:59';
2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;

3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59';
   Current design this works with ALLOW FILTERING ONLY
4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
00:00:00' AND e_date <='16Apr2017 23:59:59';
   Current design this works with ALLOW FILTERING ONLY

Looking for better design for this case, keeping in mind dynamic tables
usecase and queries listed.

Thanks in advance,
Naresh

Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread Voytek Jarnot

Apologies for the stream-of-consciousness replies, but are the dropped
message stats output by tpstats an accumulation since the node came up, or
are there processes which clear and/or time-out the info?

On Mon, Mar 20, 2017 at 3:18 PM, Voytek Jarnot <voytek.jar...@gmail.com>
wrote:

> No dropped messages in tpstats on any of the nodes.
>
> On Mon, Mar 20, 2017 at 3:11 PM, Voytek Jarnot <voytek.jar...@gmail.com>
> wrote:
>
>> Appreciate the reply, Kurt.
>>
>> I sanitized it out of the traces, but all trace outputs listed the same
>> node for all three queries (1 working, 2 not working). Read repair chance
>> set to 0.0 as recommended when using TWCS.
>>
>> I'll check tpstats - in this environment, load is not an issue, but
>> network issues may be.
>>
>> On Mon, Mar 20, 2017 at 2:42 PM, kurt greaves <k...@instaclustr.com>
>> wrote:
>>
>>> As secondary indexes are stored individually on each node what you're
>>> suggesting sounds exactly like a consistency issue. the fact that you read
>>> 0 cells on one query implies the node that got the query did not have any
>>> data for the row. The reason you would sometimes see different behaviours
>>> is likely because of read repairs. The fact that the repair guides the
>>> issue pretty much guarantees it's a consistency issue.
>>>
>>> You should check for dropped mutations in tpstats/logs and if they are
>>> occurring try and stop that from happening (probably load related). You
>>> could also try performing reads and writes at LOCAL_QUORUM for stronger
>>> consistency, however note this has a performance/latency impact.
>>>
>>>
>>>
>>
>

Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread Voytek Jarnot

No dropped messages in tpstats on any of the nodes.

On Mon, Mar 20, 2017 at 3:11 PM, Voytek Jarnot <voytek.jar...@gmail.com>
wrote:

> Appreciate the reply, Kurt.
>
> I sanitized it out of the traces, but all trace outputs listed the same
> node for all three queries (1 working, 2 not working). Read repair chance
> set to 0.0 as recommended when using TWCS.
>
> I'll check tpstats - in this environment, load is not an issue, but
> network issues may be.
>
> On Mon, Mar 20, 2017 at 2:42 PM, kurt greaves <k...@instaclustr.com>
> wrote:
>
>> As secondary indexes are stored individually on each node what you're
>> suggesting sounds exactly like a consistency issue. the fact that you read
>> 0 cells on one query implies the node that got the query did not have any
>> data for the row. The reason you would sometimes see different behaviours
>> is likely because of read repairs. The fact that the repair guides the
>> issue pretty much guarantees it's a consistency issue.
>>
>> You should check for dropped mutations in tpstats/logs and if they are
>> occurring try and stop that from happening (probably load related). You
>> could also try performing reads and writes at LOCAL_QUORUM for stronger
>> consistency, however note this has a performance/latency impact.
>>
>>
>>
>

Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread Voytek Jarnot

Appreciate the reply, Kurt.

I sanitized it out of the traces, but all trace outputs listed the same
node for all three queries (1 working, 2 not working). Read repair chance
set to 0.0 as recommended when using TWCS.

I'll check tpstats - in this environment, load is not an issue, but network
issues may be.

On Mon, Mar 20, 2017 at 2:42 PM, kurt greaves <k...@instaclustr.com> wrote:

> As secondary indexes are stored individually on each node what you're
> suggesting sounds exactly like a consistency issue. the fact that you read
> 0 cells on one query implies the node that got the query did not have any
> data for the row. The reason you would sometimes see different behaviours
> is likely because of read repairs. The fact that the repair guides the
> issue pretty much guarantees it's a consistency issue.
>
> You should check for dropped mutations in tpstats/logs and if they are
> occurring try and stop that from happening (probably load related). You
> could also try performing reads and writes at LOCAL_QUORUM for stronger
> consistency, however note this has a performance/latency impact.
>
>
>

Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread kurt greaves

As secondary indexes are stored individually on each node what you're
suggesting sounds exactly like a consistency issue. the fact that you read
0 cells on one query implies the node that got the query did not have any
data for the row. The reason you would sometimes see different behaviours
is likely because of read repairs. The fact that the repair guides the
issue pretty much guarantees it's a consistency issue.

You should check for dropped mutations in tpstats/logs and if they are
occurring try and stop that from happening (probably load related). You
could also try performing reads and writes at LOCAL_QUORUM for stronger
consistency, however note this has a performance/latency impact.

Re: Very odd & inconsistent results from SASI query

2017-03-17 Thread Voytek Jarnot

A wrinkle further confounds the issue: running a repair on the node which
was servicing the queries has cleared things up and all the queries now
work.

That doesn't make a whole lot of sense to me - my assumption was that a
repair shouldn't have fixed it.

On Fri, Mar 17, 2017 at 12:03 PM, Voytek Jarnot <voytek.jar...@gmail.com>
wrote:

> Cassandra 3.9, 4 nodes, rf=3
>
> Hi folks, we're see 0 results returned from queries that (a) should return
> results, and (b) will return results with minor tweaks.
>
> I've attached the sanitized trace outputs for the following 3 queries (pk1
> and pk2 are partition keys, ck1 is clustering key, val1 is SASI indexed
> non-key column):
>
> Q1: SELECT * FROM t1 WHERE pk1 = 2017 AND pk2 = 11  AND  ck1 >=
> '2017-03-16' AND ck1 <= '2017-03-17'  AND val1 LIKE 'abcdefgh%'  LIMIT 1001
> ALLOW FILTERING;
> Q1 works - it returns a list of records, one of which has
> val1='abcdefghijklmn'.
>
> Q2: SELECT * FROM t1 WHERE pk1 = 2017 AND pk2 = 11  AND  ck1 >=
> '2017-03-16' AND ck1 <= '2017-03-17'  AND val1 LIKE 'abcdefghi%'  LIMIT
> 1001 ALLOW FILTERING;
> Q2 does not work - 0 results returned. Only difference to Q1 is one
> additional character provided in LIKE comparison.
>
> Q3: SELECT * FROM t1 WHERE pk1 = 2017 AND pk2 = 11  AND  ck1 >=
> '2017-03-16' AND ck2 <= '2017-03-17'  AND val1 = 'abcdefghijklmn'  LIMIT
> 1001 ALLOW FILTERING;
> Q3 does not work - 0 results returned.
>
> As I've written above, the data set *does* include a record with
> val1='abcdefghijklmn'.
>
> Confounding the issue is that this behavior is inconsistent.  For
> different values of val1, I'll have scenarios where Q3 works, but Q1 and Q2
> do not. Now, that particular behavior I could explain with index/like
> problems, but it is Q3 that sometimes does not work and that's a simply
> equality comparison (although still using the index).
>
> Further confounding the issue is that if my testers run these same queries
> with the same parameters tomorrow, they're likely to work correctly.
>
> Only thing I've been able to glean from tracing execution is that the
> queries that work follow "Executing read..." with "Executing single
> partition query on t1" and so forth,  whereas the queries that don't work
> simply follow "Executing read..." with "Read 0 live and 0 tombstone cells"
> with no actual work seemingly done. But that's not helping me narrow this
> down much.
>
> Thanks for your time - appreciate any help.
>

Very odd & inconsistent results from SASI query

2017-03-17 Thread Voytek Jarnot

Cassandra 3.9, 4 nodes, rf=3

Hi folks, we're see 0 results returned from queries that (a) should return
results, and (b) will return results with minor tweaks.

I've attached the sanitized trace outputs for the following 3 queries (pk1
and pk2 are partition keys, ck1 is clustering key, val1 is SASI indexed
non-key column):

Q1: SELECT * FROM t1 WHERE pk1 = 2017 AND pk2 = 11  AND  ck1 >=
'2017-03-16' AND ck1 <= '2017-03-17'  AND val1 LIKE 'abcdefgh%'  LIMIT 1001
ALLOW FILTERING;
Q1 works - it returns a list of records, one of which has
val1='abcdefghijklmn'.

Q2: SELECT * FROM t1 WHERE pk1 = 2017 AND pk2 = 11  AND  ck1 >=
'2017-03-16' AND ck1 <= '2017-03-17'  AND val1 LIKE 'abcdefghi%'  LIMIT
1001 ALLOW FILTERING;
Q2 does not work - 0 results returned. Only difference to Q1 is one
additional character provided in LIKE comparison.

Q3: SELECT * FROM t1 WHERE pk1 = 2017 AND pk2 = 11  AND  ck1 >=
'2017-03-16' AND ck2 <= '2017-03-17'  AND val1 = 'abcdefghijklmn'  LIMIT
1001 ALLOW FILTERING;
Q3 does not work - 0 results returned.

As I've written above, the data set *does* include a record with
val1='abcdefghijklmn'.

Confounding the issue is that this behavior is inconsistent.  For different
values of val1, I'll have scenarios where Q3 works, but Q1 and Q2 do not.
Now, that particular behavior I could explain with index/like problems, but
it is Q3 that sometimes does not work and that's a simply equality
comparison (although still using the index).

Further confounding the issue is that if my testers run these same queries
with the same parameters tomorrow, they're likely to work correctly.

Only thing I've been able to glean from tracing execution is that the
queries that work follow "Executing read..." with "Executing single
partition query on t1" and so forth,  whereas the queries that don't work
simply follow "Executing read..." with "Read 0 live and 0 tombstone cells"
with no actual work seemingly done. But that's not helping me narrow this
down much.

Thanks for your time - appreciate any help.
Results found query (which include record where val='abcdefghijklmn'):

 Parsing SELECT * FROM t1 WHERE pk1 = 2017 AND pk2 = 11  AND  ck1 >= 
'2017-03-16' AND ck1 <= '2017-03-17'  AND val1 LIKE 'abcdefgh%'  LIMIT 1001 
ALLOW FILTERING; [Native-Transport-Requests-1]

  Preparing 
statement [Native-Transport-Requests-1]
  Index 
mean cardinalities are idx_my_idx:-9223372036854775808. Scanning with 
idx_my_idx. [Native-Transport-Requests-1]

    Computing ranges to 
query [Native-Transport-Requests-1]
   Submitting range 
requests on 1 ranges with a concurrency of 1 (-1.08086395E16 rows per range 
expected) [Native-Transport-Requests-1]

Submitted 1 concurrent range 
requests [Native-Transport-Requests-1]

 Executing read on keyspace.t1 
using index idx_my_idx [ReadStage-2]

       Executing 
single-partition query on t1 [ReadStage-2]

 Acquiring 
sstable references [ReadStage-2]

   Key cache 
hit for sstable 2223 [ReadStage-2]

  Skipped 34/35 non-slice-intersecting sstables, included 1 
due to tombstones [ReadStage-2]

   Key cache 
hit for sstable 2221 [ReadStage-2]

Merged data from 
memtables and 2 sstables [ReadStage-2]

Re: Trouble implementing CAS operation with LWT query

2017-02-22 Thread Edward Capriolo

On Wed, Feb 22, 2017 at 8:42 AM, 안정아 <jungah@samsung.com> wrote:

> Hi, all
>
>
>
> I'm trying to implement a typical CAS operation with LWT query(conditional
> update).
>
> But I'm having trouble keeping integrity of the result when
> WriteTimeoutException occurs.
>
> according to http://www.datastax.com/dev/blog/cassandra-error-handling-
> done-right
>
> "If the paxos phase fails, the driver will throw a WriteTimeoutException
> with a WriteType.
>
> CAS as retrieved with WriteTimeoutException#getWriteType().
>
> In this situation you can’t know if the CAS operation has been applied..."
>
> 1) Doesn't it ruin the whole point of using LWT for CAS operation if you
> can't be sure whether the query is applied or not?
>
> 2-1) Is there anyway to know whether the query is applied when timeout
> occurred?
>
> 2-2) If we can't tell, are there any way to workaround this and keep the
> CAS integrity?
>
>
>
> Thanks!
>
>
>
>
>

What you might be first trying to do is count the timeouts:

https://github.com/edwardcapriolo/ec/blob/master/src/test/java/Base/CompareAndSwapTest.java

https://github.com/edwardcapriolo/ec/blob/master/src/test/java/Base/CompareAndSwapTest.java#L99

This tactic does not work.

However you can keep re-reading at CL.Serial to determine i the update
applied.

What I found this to mean is you CANT do this:

for (i=0;i<2000;i++){
   new Thread(){ () -> { doCasInsert() } }.start();
}

Assert.assertEquals(2000, getTotalInserts())

But you CAN do this:

for (i=0;i<2000;i++){
  new Thread ( () -> {
  count = "SELECT count(1) from".setConstistencyLevel(cl.Serial)
  if (count < 2000){
 doCasInsert()
  }
  });
}


Essentially, because you want know if a CAS operation will succeed even in
a client timeout in the future you can not "COUNT" on the insert side. Y

Trouble implementing CAS operation with LWT query

2017-02-22 Thread 안정아



Hi, all
 
I'm trying to implement a typical CAS operation with LWT query(conditional update). 
But I'm having trouble keeping integrity of the result when WriteTimeoutException occurs. 
according to http://www.datastax.com/dev/blog/cassandra-error-handling-done-right 
"If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.
CAS as retrieved with WriteTimeoutException#getWriteType().
In this situation you can’t know if the CAS operation has been applied..." 
1) Doesn't it ruin the whole point of using LWT for CAS operation if you can't be sure whether the query is applied or not? 
2-1) Is there anyway to know whether the query is applied when timeout occurred? 
2-2) If we can't tell, are there any way to workaround this and keep the CAS integrity?
 
Thanks!

Re: is there a query to find out the largest partition in a table?

2017-02-18 Thread Kant Kodali

*I did the following. Now I wonder if this is one node or multiple nodes?
Does this value really tell me I have a large partition?*

nodetool cfhistograms test hello // This reports the max partition size is
10GB

nodetool tablestats test.hello // This also reports Compacted partition
maximum bytes: 10299432635


Percentile  SSTables Write Latency  Read LatencyPartition Size
   Cell Count

  (micros)  (micros)   (bytes)


50% 0.00 20.50 51.01 155469300
   654949

75% 0.00 24.60 88.154139110981
 17436917

95% 6.00 29.52 155469.30   10299432635
 43388628

98% 6.00 42.51 668489.53   10299432635
 43388628

99% 6.00 61.21 802187.44   10299432635
 43388628

Min 0.00  5.72  9.89   125
5

Max 6.00 668489.538582860.53   10299432635
 43388628

On Sat, Feb 18, 2017 at 12:28 AM, Kant Kodali <k...@peernova.com> wrote:

> is there a query to find out the largest partition in a table? Does the
> query below give me the largest partition?
>
> select max(mean_partition_size) from size_estimates ;
>
> Thanks,
> Kant
>

is there a query to find out the largest partition in a table?

2017-02-18 Thread Kant Kodali

is there a query to find out the largest partition in a table? Does the
query below give me the largest partition?

select max(mean_partition_size) from size_estimates ;

Thanks,
Kant

RE: Query on Cassandra clusters

2017-01-03 Thread SEAN_R_DURITY

A couple thoughts (for after you up/downgrade to one version for all nodes):

-  16 GB of total RAM on a node is a minimum I would use; 32 would be 
much better

-  With a lower amount of memory, I think would keep memtables on-heap 
in order to keep a tighter rein on how much they use. If you are consistently 
using 75% or more of heap space, you need more (either more nodes or more 
memory per node).

-  I would try giving Cassandra 50% of the RAM on the host. And remove 
any client or non-Cassandra processes. Nodes should be dedicated to Cassandra 
(for Production)

-  For disk, my rule for size-tiered is that you need 50% overhead IF 
it is primarily a single table application (90%+ of data in one table). 
Otherwise, I am ok with 35-40% overhead. Just know you can hit issues down the 
road as the sstables get larger.


Sean Durity
From: Sumit Anvekar [mailto:sumit.anve...@gmail.com]
Sent: Wednesday, December 21, 2016 3:47 PM
To: user@cassandra.apache.org
Subject: Re: Query on Cassandra clusters

Thank you Alain for the detailed explanation.
To answer you question on Java version, JVM settings and Memory usage. We are 
using using 1.8.0_45. precisely
>java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
JVM settings are identical on all nodes (cassandra-env.sh is identical).
Further when I say high on memory usage, Cassandra is using heap (-Xmx3767M) 
and off heap of about 6GB out of the total system memory of 14.7 GB. Along with 
this there are other processes running on this system which is bring the 
overall memory usage to >95%. This bring me to another point whether heap 
memory + off heap (sum of values of Space used (total)) from nodetool cfstats 
is the total memory used by Cassandra on a node?
Also, on the disk front, what is a good amount of empty space to be left out 
unused in the partition(~50%
 should be?) considering we use SizeTieredCompaction strategy?

On Wed, Dec 21, 2016 at 6:30 PM, Alain RODRIGUEZ 
<arodr...@gmail.com<mailto:arodr...@gmail.com>> wrote:
Hi Sumit,

1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra version 
3.0.3 and then newer 5 nodes have 3.6.0 version.

I strongly recommend to:


  *   Stick with one version of Apache Cassandra per cluster.
  *   Always be as close as possible from the last minor release of the 
Cassandra version in use.

So you really should not be using 3.0.6 AND 3.6.0 but rather 3.0.10 OR 3.7 
(currently). Note that Cassandra 3.X (with X > 0) uses a tic toc release cycle 
where odd are bug fixes only and even numbers introduce new features as well.

Running multiple version for a long period can induces errors, Cassandra is 
built to handle multiple versions only to give the time to operators to run a 
rolling restart. No streaming (adding / removing / repairing nodes) should 
happen during this period. Also, I have seen in the past some cases where 
changing the schema was also an issue with multiple versions leading to schema 
disagreements.

Due to this scenario, a couple boxes are running very high on memory (95% 
usage) whereas some of the older version nodes have just 60-70% memory usage.

Hard to say if this is related to the mutiple versions of Cassandra but it 
could. Are you sure nodes are using the same JVM / GC options 
(cassandra-env.sh) and Java version?

Also, what is exactly "high on memory 95%"? Are we talking about heap or Native 
memory. Isn't the memory used as page cache (that would still be available for 
the system)?

2. To counter #1, I am planning to upgrade system configuration of the nodes 
where there is higher memory usage. But the question is, will it be a problem 
if we have a Cassandra cluster, where in a couple of nodes have double the 
system configuration than other nodes in the cluster.

It is not a problem per se to have distinct configurations on distinct nodes. 
Cassandra does it very well, and it is frequently used to test some 
configuration change on a canary node, to prevent it from impacting the whole 
service.

Yet, all the nodes should be doing the same work (unless you have some 
heterogenous hardware and are using distinct number of vnodes on each node). 
Keeping things homogenous allows the operator to easily compare how nodes are 
doing and it makes reasoning about Cassandra, as well as troubleshooting issues 
a way easier.

So I would:

- Fully upgrade / downgrade asap to a chosen version (3.X is known as being not 
yet stable, but going back to 3.0.X might be more painful)
- Make sure nodes are well balanced and using the same number of ranges 
'nodetool status '
- Make sure the node are using the same Java version and JVM settings.

Hope that helps,

C*heers,
---
Alain Rodriguez - @arodream - 
al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France

The

Re: Query

2016-12-30 Thread Work

Actually, "noSQL" is a misleading misnomer. With C* you have CQL which is 
adapted from SQL syntax and purpose.

For a poster boy, try Netflix.

Regards,

James 

Sent from my iPhone

> On Dec 30, 2016, at 4:59 AM, Sikander Rafiq <hafiz_ra...@hotmail.com> wrote:
> 
> Thanks for your comments/suggestions.
> 
> 
> Yes I understand my project needs and requirements. Surely it requires to 
> handle huge data for what i'm exploring what suits for it.
> 
> 
> Though Cassandra is distributed, scalable and highly available, but it is 
> NoSql means Sql part is missing and needs to be handled.
> 
> 
> 
> Can anyone please tell me some big name who is using Cassandra for handling 
> its huge data sets like Twitter etc.
> 
> 
> 
> Sent from Outlook
> 
> 
>  
> From: Edward Capriolo <edlinuxg...@gmail.com>
> Sent: Friday, December 30, 2016 5:53 AM
> To: user@cassandra.apache.org
> Subject: Re: Query
>  
> You should start with understanding your needs. Once you understand your need 
> you can pick the software that fits your need. Staring with a software stack 
> is backwards.
> 
>> On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater <ben.sla...@instaclustr.com> 
>> wrote:
>> I wasn’t familiar with Gizzard either so I thought I’d take a look. The 
>> first things on their github readme is:
>> NB: This project is currently not recommended as a base for new consumers.
>> (And no commits since 2013)
>> 
>> So, Cassandra definitely looks like a better choice as your datastore for a 
>> new project.
>> 
>> Cheers
>> Ben
>> 
>>> On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar <khangaon...@gmail.com> 
>>> wrote:
>>> I am not that familiar with gizzard but with gizzard + mysql , you have 
>>> multiple moving parts in the system that need to managed separately. You'll 
>>> need the mysql expert for mysql and the gizzard expert to manage the 
>>> distributed part. It can be argued that long term this will have higher 
>>> adminstration cost
>>> 
>>> Cassandra's value add is its simple peer to peer architecture that is easy 
>>> to manage - a single database solution that is distributed, scalable, 
>>> highly available etc. In other words, once you gain expertise cassandra, 
>>> you get everything in one package.
>>> 
>>> regards
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq <hafiz_ra...@hotmail.com> 
>>> wrote:
>>> Hi,
>>> 
>>> I'm exploring Cassandra for handling large data sets for mobile app, but 
>>> i'm not clear where it stands.
>>> 
>>> 
>>> If we use MySQL as  underlying database and Gizzard for building custom 
>>> distributed databases (with arbitrary storage technology) and Memcached for 
>>> highly queried data, then where lies Cassandra?
>>> 
>>> 
>>> 
>>> As i have read that Twitter uses both Cassandra and Gizzard. Please explain 
>>> me where Cassandra will act.
>>> 
>>> 
>>> Thanks in advance.
>>> 
>>> 
>>> Regards,
>>> 
>>> Sikander
>>> 
>>> 
>>> 
>>> Sent from Outlook
>>> 
>>> 
>>> 
>>> -- 
>>> http://khangaonkar.blogspot.com/
>

RE: Query

2016-12-30 Thread SEAN_R_DURITY

A few of the many companies that rely on Cassandra are mentioned here:
http://cassandra.apache.org
Apple, Netflix, Weather Channel, etc.
(Not nearly as good as the Planet Cassandra list that DataStax used to 
maintain. Boo for the Apache/DataStax squabble!)

DataStax has a list of many case studies, too, with their enterprise version of 
Cassandra:
http://www.datastax.com/resources/casestudies


Sean Durity

From: Sikander Rafiq [mailto:hafiz_ra...@hotmail.com]
Sent: Friday, December 30, 2016 8:00 AM
To: user@cassandra.apache.org
Subject: Re: Query


Thanks for your comments/suggestions.



Yes I understand my project needs and requirements. Surely it requires to 
handle huge data for what i'm exploring what suits for it.



Though Cassandra is distributed, scalable and highly available, but it is NoSql 
means Sql part is missing and needs to be handled.



Can anyone please tell me some big name who is using Cassandra for handling its 
huge data sets like Twitter etc.





Sent from Outlook<http://aka.ms/weboutlook>


From: Edward Capriolo <edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>>
Sent: Friday, December 30, 2016 5:53 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Query

You should start with understanding your needs. Once you understand your need 
you can pick the software that fits your need. Staring with a software stack is 
backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
<ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>> wrote:
I wasn't familiar with Gizzard either so I thought I'd take a look. The first 
things on their github readme is:
NB: This project is currently not recommended as a base for new consumers.
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a new 
project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
<khangaon...@gmail.com<mailto:khangaon...@gmail.com>> wrote:
I am not that familiar with gizzard but with gizzard + mysql , you have 
multiple moving parts in the system that need to managed separately. You'll 
need the mysql expert for mysql and the gizzard expert to manage the 
distributed part. It can be argued that long term this will have higher 
adminstration cost
Cassandra's value add is its simple peer to peer architecture that is easy to 
manage - a single database solution that is distributed, scalable, highly 
available etc. In other words, once you gain expertise cassandra, you get 
everything in one package.
regards




On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
<hafiz_ra...@hotmail.com<mailto:hafiz_ra...@hotmail.com>> wrote:

Hi,

I'm exploring Cassandra for handling large data sets for mobile app, but i'm 
not clear where it stands.



If we use MySQL as  underlying database and Gizzard for building custom 
distributed databases (with arbitrary storage technology) and Memcached for 
highly queried data, then where lies Cassandra?



As i have read that Twitter uses both Cassandra and Gizzard. Please explain me 
where Cassandra will act.



Thanks in advance.



Regards,

Sikander




Sent from Outlook<http://aka.ms/weboutlook>


--
http://khangaonkar.blogspot.com/




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot discla

Re: Query

2016-12-30 Thread Sikander Rafiq

Thanks for your comments/suggestions.


Yes I understand my project needs and requirements. Surely it requires to 
handle huge data for what i'm exploring what suits for it.


Though Cassandra is distributed, scalable and highly available, but it is NoSql 
means Sql part is missing and needs to be handled.


Can anyone please tell me some big name who is using Cassandra for handling its 
huge data sets like Twitter etc.



Sent from Outlook<http://aka.ms/weboutlook>



From: Edward Capriolo <edlinuxg...@gmail.com>
Sent: Friday, December 30, 2016 5:53 AM
To: user@cassandra.apache.org
Subject: Re: Query

You should start with understanding your needs. Once you understand your need 
you can pick the software that fits your need. Staring with a software stack is 
backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
<ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>> wrote:
I wasn't familiar with Gizzard either so I thought I'd take a look. The first 
things on their github readme is:
NB: This project is currently not recommended as a base for new consumers.
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a new 
project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
<khangaon...@gmail.com<mailto:khangaon...@gmail.com>> wrote:
I am not that familiar with gizzard but with gizzard + mysql , you have 
multiple moving parts in the system that need to managed separately. You'll 
need the mysql expert for mysql and the gizzard expert to manage the 
distributed part. It can be argued that long term this will have higher 
adminstration cost

Cassandra's value add is its simple peer to peer architecture that is easy to 
manage - a single database solution that is distributed, scalable, highly 
available etc. In other words, once you gain expertise cassandra, you get 
everything in one package.

regards





On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
<hafiz_ra...@hotmail.com<mailto:hafiz_ra...@hotmail.com>> wrote:

Hi,

I'm exploring Cassandra for handling large data sets for mobile app, but i'm 
not clear where it stands.


If we use MySQL as  underlying database and Gizzard for building custom 
distributed databases (with arbitrary storage technology) and Memcached for 
highly queried data, then where lies Cassandra?


As i have read that Twitter uses both Cassandra and Gizzard. Please explain me 
where Cassandra will act.


Thanks in advance.


Regards,

Sikander



Sent from Outlook<http://aka.ms/weboutlook>



--
http://khangaonkar.blogspot.com/

Re: Query

2016-12-29 Thread Edward Capriolo

You should start with understanding your needs. Once you understand your
need you can pick the software that fits your need. Staring with a software
stack is backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
wrote:

> I wasn’t familiar with Gizzard either so I thought I’d take a look. The
> first things on their github readme is:
> *NB: This project is currently not recommended as a base for new
> consumers.*
> (And no commits since 2013)
>
> So, Cassandra definitely looks like a better choice as your datastore for
> a new project.
>
> Cheers
> Ben
>
> On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
> wrote:
>
>> I am not that familiar with gizzard but with gizzard + mysql , you have
>> multiple moving parts in the system that need to managed separately. You'll
>> need the mysql expert for mysql and the gizzard expert to manage the
>> distributed part. It can be argued that long term this will have higher
>> adminstration cost
>>
>> Cassandra's value add is its simple peer to peer architecture that is
>> easy to manage - a single database solution that is distributed, scalable,
>> highly available etc. In other words, once you gain expertise cassandra,
>> you get everything in one package.
>>
>> regards
>>
>>
>>
>>
>>
>> On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
>> wrote:
>>
>> Hi,
>>
>> I'm exploring Cassandra for handling large data sets for mobile app, but
>> i'm not clear where it stands.
>>
>>
>> If we use MySQL as  underlying database and Gizzard for building custom
>> distributed databases (with arbitrary storage technology) and Memcached for
>> highly queried data, then where lies Cassandra?
>>
>>
>> As i have read that Twitter uses both Cassandra and Gizzard. Please
>> explain me where Cassandra will act.
>>
>>
>> Thanks in advance.
>>
>>
>> Regards,
>>
>> Sikander
>>
>>
>> Sent from Outlook 
>>
>>
>>
>>
>> --
>> http://khangaonkar.blogspot.com/
>>
>

Re: Query

2016-12-29 Thread Ben Slater

I wasn’t familiar with Gizzard either so I thought I’d take a look. The
first things on their github readme is:
*NB: This project is currently not recommended as a base for new consumers.*
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a
new project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
wrote:

> I am not that familiar with gizzard but with gizzard + mysql , you have
> multiple moving parts in the system that need to managed separately. You'll
> need the mysql expert for mysql and the gizzard expert to manage the
> distributed part. It can be argued that long term this will have higher
> adminstration cost
>
> Cassandra's value add is its simple peer to peer architecture that is easy
> to manage - a single database solution that is distributed, scalable,
> highly available etc. In other words, once you gain expertise cassandra,
> you get everything in one package.
>
> regards
>
>
>
>
>
> On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
> wrote:
>
> Hi,
>
> I'm exploring Cassandra for handling large data sets for mobile app, but
> i'm not clear where it stands.
>
>
> If we use MySQL as  underlying database and Gizzard for building custom
> distributed databases (with arbitrary storage technology) and Memcached for
> highly queried data, then where lies Cassandra?
>
>
> As i have read that Twitter uses both Cassandra and Gizzard. Please
> explain me where Cassandra will act.
>
>
> Thanks in advance.
>
>
> Regards,
>
> Sikander
>
>
> Sent from Outlook 
>
>
>
>
> --
> http://khangaonkar.blogspot.com/
>

Re: Query

2016-12-29 Thread Manoj Khangaonkar

I am not that familiar with gizzard but with gizzard + mysql , you have
multiple moving parts in the system that need to managed separately. You'll
need the mysql expert for mysql and the gizzard expert to manage the
distributed part. It can be argued that long term this will have higher
adminstration cost

Cassandra's value add is its simple peer to peer architecture that is easy
to manage - a single database solution that is distributed, scalable,
highly available etc. In other words, once you gain expertise cassandra,
you get everything in one package.

regards

On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
wrote:

> Hi,
>
> I'm exploring Cassandra for handling large data sets for mobile app, but
> i'm not clear where it stands.
>
>
> If we use MySQL as  underlying database and Gizzard for building custom
> distributed databases (with arbitrary storage technology) and Memcached for
> highly queried data, then where lies Cassandra?
>
>
> As i have read that Twitter uses both Cassandra and Gizzard. Please
> explain me where Cassandra will act.
>
>
> Thanks in advance.
>
>
> Regards,
>
> Sikander
>
>
> Sent from Outlook 
>

-- 
http://khangaonkar.blogspot.com/

Query

2016-12-29 Thread Sikander Rafiq

Hi,

I'm exploring Cassandra for handling large data sets for mobile app, but i'm 
not clear where it stands.


If we use MySQL as  underlying database and Gizzard for building custom 
distributed databases (with arbitrary storage technology) and Memcached for 
highly queried data, then where lies Cassandra?


As i have read that Twitter uses both Cassandra and Gizzard. Please explain me 
where Cassandra will act.


Thanks in advance.


Regards,

Sikander



Sent from Outlook

Re: Comment on query performance

2016-12-29 Thread Ashutosh Dhundhara

Thanks DuyHai once again :-)

On Thu, Dec 29, 2016 at 3:35 PM, DuyHai Doan <doanduy...@gmail.com> wrote:

> No full table scan because you specify all the partition key columns in
> your WHERE clause.
>
> On Thu, Dec 29, 2016 at 11:02 AM, Ashutosh Dhundhara <
> ashutoshdhundh...@yahoo.com> wrote:
>
>> Thanks DuyHai.
>>
>> One more thing, is it going to be a full table scan across all the nodes
>> in cluster?
>>
>> On Thu, Dec 29, 2016 at 3:30 PM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>>
>>> In your case, ALLOW FILTERING will require Cassandra to scan linearly on
>>> disk and fetch all the partition data into memory  so the performance
>>> depends on how "large" your partition is. For small partitions it should be
>>> fine.
>>>
>>>
>>> On Thu, Dec 29, 2016 at 10:00 AM, Ashutosh Dhundhara <
>>> ashutoshdhundh...@yahoo.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have a table like this:
>>>>
>>>> CREATE TABLE IF NOT EXISTS Posts (
>>>> idObject int,
>>>> objectType text,
>>>> idParent int,
>>>> id int,
>>>> idResolution int,
>>>> PRIMARY KEY ((idObject, objectType, idParent), id)
>>>> );
>>>>
>>>> Now have a look at the following query:
>>>>
>>>> SELECT * FROM POSTS WHERE idobject = 1 AND objectType = 'COURSE' AND 
>>>> idParent = 0 AND idResolution = 1 ALLOW FILTERING
>>>>
>>>> Now the Partition Key is completely known, so if I use ALLOW FILTERING is
>>>> there going to be any performance issue because the filtering is going to
>>>> be done in a known single partition?
>>>>
>>>>
>>>> --
>>>> Ashutosh Dhundhara
>>>>
>>>
>>>
>>
>>
>> --
>> Ashutosh Dhundhara
>>
>
>


-- 
Ashutosh Dhundhara

Re: Comment on query performance

2016-12-29 Thread DuyHai Doan

No full table scan because you specify all the partition key columns in
your WHERE clause.

On Thu, Dec 29, 2016 at 11:02 AM, Ashutosh Dhundhara <
ashutoshdhundh...@yahoo.com> wrote:

> Thanks DuyHai.
>
> One more thing, is it going to be a full table scan across all the nodes
> in cluster?
>
> On Thu, Dec 29, 2016 at 3:30 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
>> In your case, ALLOW FILTERING will require Cassandra to scan linearly on
>> disk and fetch all the partition data into memory  so the performance
>> depends on how "large" your partition is. For small partitions it should be
>> fine.
>>
>>
>> On Thu, Dec 29, 2016 at 10:00 AM, Ashutosh Dhundhara <
>> ashutoshdhundh...@yahoo.com> wrote:
>>
>>> Hi All,
>>>
>>> I have a table like this:
>>>
>>> CREATE TABLE IF NOT EXISTS Posts (
>>> idObject int,
>>> objectType text,
>>> idParent int,
>>> id int,
>>> idResolution int,
>>> PRIMARY KEY ((idObject, objectType, idParent), id)
>>> );
>>>
>>> Now have a look at the following query:
>>>
>>> SELECT * FROM POSTS WHERE idobject = 1 AND objectType = 'COURSE' AND 
>>> idParent = 0 AND idResolution = 1 ALLOW FILTERING
>>>
>>> Now the Partition Key is completely known, so if I use ALLOW FILTERING is
>>> there going to be any performance issue because the filtering is going to
>>> be done in a known single partition?
>>>
>>>
>>> --
>>> Ashutosh Dhundhara
>>>
>>
>>
>
>
> --
> Ashutosh Dhundhara
>

Re: Comment on query performance

2016-12-29 Thread Ashutosh Dhundhara

Thanks DuyHai.

One more thing, is it going to be a full table scan across all the nodes in
cluster?

On Thu, Dec 29, 2016 at 3:30 PM, DuyHai Doan <doanduy...@gmail.com> wrote:

> In your case, ALLOW FILTERING will require Cassandra to scan linearly on
> disk and fetch all the partition data into memory  so the performance
> depends on how "large" your partition is. For small partitions it should be
> fine.
>
>
> On Thu, Dec 29, 2016 at 10:00 AM, Ashutosh Dhundhara <
> ashutoshdhundh...@yahoo.com> wrote:
>
>> Hi All,
>>
>> I have a table like this:
>>
>> CREATE TABLE IF NOT EXISTS Posts (
>> idObject int,
>> objectType text,
>> idParent int,
>> id int,
>> idResolution int,
>> PRIMARY KEY ((idObject, objectType, idParent), id)
>> );
>>
>> Now have a look at the following query:
>>
>> SELECT * FROM POSTS WHERE idobject = 1 AND objectType = 'COURSE' AND 
>> idParent = 0 AND idResolution = 1 ALLOW FILTERING
>>
>> Now the Partition Key is completely known, so if I use ALLOW FILTERING is
>> there going to be any performance issue because the filtering is going to
>> be done in a known single partition?
>>
>>
>> --
>> Ashutosh Dhundhara
>>
>
>


-- 
Ashutosh Dhundhara

Re: Comment on query performance

2016-12-29 Thread DuyHai Doan

In your case, ALLOW FILTERING will require Cassandra to scan linearly on
disk and fetch all the partition data into memory  so the performance
depends on how "large" your partition is. For small partitions it should be
fine.


On Thu, Dec 29, 2016 at 10:00 AM, Ashutosh Dhundhara <
ashutoshdhundh...@yahoo.com> wrote:

> Hi All,
>
> I have a table like this:
>
> CREATE TABLE IF NOT EXISTS Posts (
> idObject int,
> objectType text,
> idParent int,
> id int,
> idResolution int,
> PRIMARY KEY ((idObject, objectType, idParent), id)
> );
>
> Now have a look at the following query:
>
> SELECT * FROM POSTS WHERE idobject = 1 AND objectType = 'COURSE' AND idParent 
> = 0 AND idResolution = 1 ALLOW FILTERING
>
> Now the Partition Key is completely known, so if I use ALLOW FILTERING is
> there going to be any performance issue because the filtering is going to
> be done in a known single partition?
>
>
> --
> Ashutosh Dhundhara
>

Comment on query performance

2016-12-29 Thread Ashutosh Dhundhara

Hi All,

I have a table like this:

CREATE TABLE IF NOT EXISTS Posts (
idObject int,
objectType text,
idParent int,
id int,
idResolution int,
PRIMARY KEY ((idObject, objectType, idParent), id)
);

Now have a look at the following query:

SELECT * FROM POSTS WHERE idobject = 1 AND objectType = 'COURSE' AND
idParent = 0 AND idResolution = 1 ALLOW FILTERING

Now the Partition Key is completely known, so if I use ALLOW FILTERING is
there going to be any performance issue because the filtering is going to
be done in a known single partition?


-- 
Ashutosh Dhundhara

Re: Query on Cassandra clusters

2016-12-21 Thread Sumit Anvekar

Thank you Alain for the detailed explanation.

To answer you question on Java version, JVM settings and Memory usage. We
are using using 1.8.0_45. precisely
>java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

JVM settings are identical on all nodes (cassandra-env.sh is identical).

Further when I say high on memory usage, Cassandra is using heap
(-Xmx3767M) and off heap of about 6GB out of the total system memory of
14.7 GB. Along with this there are other processes running on this system
which is bring the overall memory usage to >95%. This bring me to another
point whether *heap memory* + *off heap (sum of values of Space used
(total)) from nodetool cfstats* is the total memory used by Cassandra on a
node?

Also, on the disk front, what is a good amount of empty space to be left
out unused in the partition(~50%
 should be?) considering we use SizeTieredCompaction strategy?

On Wed, Dec 21, 2016 at 6:30 PM, Alain RODRIGUEZ  wrote:

> Hi Sumit,
>
> 1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
>> version 3.0.3 and then newer 5 nodes have 3.6.0 version.
>
>
> I strongly recommend to:
>
>
>- Stick with one version of Apache Cassandra per cluster.
>- Always be as close as possible from the last minor release of the
>Cassandra version in use.
>
>
> So you *really should* not be using 3.0.6 *AND* 3.6.0 but rather 3.0.10
> *OR* 3.7 (currently). Note that Cassandra 3.X (with X > 0) uses a tic toc
> release cycle where odd are bug fixes only and even numbers introduce new
> features as well.
>
> Running multiple version for a long period can induces errors, Cassandra
> is built to handle multiple versions only to give the time to operators to
> run a rolling restart. No streaming (adding / removing / repairing nodes)
> should happen during this period. Also, I have seen in the past some cases
> where changing the schema was also an issue with multiple versions leading
> to schema disagreements.
>
> Due to this scenario, a couple boxes are running very high on memory (95%
>> usage) whereas some of the older version nodes have just 60-70% memory
>> usage.
>
>
> Hard to say if this is related to the mutiple versions of Cassandra but it
> could. Are you sure nodes are using the same JVM / GC options
> (cassandra-env.sh) and Java version?
>
> Also, what is exactly "high on memory 95%"? Are we talking about heap or
> Native memory. Isn't the memory used as page cache (that would still be
> available for the system)?
>
> 2. To counter #1, I am planning to upgrade system configuration of the
>> nodes where there is higher memory usage. But the question is, will it be a
>> problem if we have a Cassandra cluster, where in a couple of nodes have
>> double the system configuration than other nodes in the cluster.
>>
>
> It is not a problem per se to have distinct configurations on distinct
> nodes. Cassandra does it very well, and it is frequently used to test some
> configuration change on a canary node, to prevent it from impacting the
> whole service.
>
> Yet, all the nodes should be doing the same work (unless you have some
> heterogenous hardware and are using distinct number of vnodes on each
> node). Keeping things homogenous allows the operator to easily compare how
> nodes are doing and it makes reasoning about Cassandra, as well as
> troubleshooting issues a way easier.
>
> So I would:
>
> - Fully upgrade / downgrade asap to a chosen version (3.X is known as
> being not yet stable, but going back to 3.0.X might be more painful)
> - Make sure nodes are well balanced and using the same number of ranges
> 'nodetool status '
> - Make sure the node are using the same Java version and JVM settings.
>
> Hope that helps,
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-12-21 8:22 GMT+01:00 Sumit Anvekar :
>
>> I have a couple questions.
>>
>> 1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
>> version 3.0.3 and then newer 5 nodes have 3.6.0 version. I has been running
>> fine until recently I am seeing higher amount of data residing in newer
>> boxes. The configuration file (YAML file) is exactly same on all nodes
>> (except for the node host names). Wondering if the version has something to
>> do with this scenario. Due to this scenario, a couple boxes are running
>> very high on memory (95% usage) whereas some of the older version nodes
>> have just 60-70% memory usage.
>>
>> 2. To counter #1, I am planning to upgrade system configuration of the
>> nodes where there is higher memory usage. But the question is, will it be a
>> problem if we have a Cassandra cluster, where in a couple of nodes have
>> double the system configuration than other nodes in the cluster.
>>
>>

Re: Query on Cassandra clusters

2016-12-21 Thread Alain RODRIGUEZ

Hi Sumit,

1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
> version 3.0.3 and then newer 5 nodes have 3.6.0 version.


I strongly recommend to:


   - Stick with one version of Apache Cassandra per cluster.
   - Always be as close as possible from the last minor release of the
   Cassandra version in use.


So you *really should* not be using 3.0.6 *AND* 3.6.0 but rather 3.0.10 *OR*
3.7 (currently). Note that Cassandra 3.X (with X > 0) uses a tic toc
release cycle where odd are bug fixes only and even numbers introduce new
features as well.

Running multiple version for a long period can induces errors, Cassandra is
built to handle multiple versions only to give the time to operators to run
a rolling restart. No streaming (adding / removing / repairing nodes)
should happen during this period. Also, I have seen in the past some cases
where changing the schema was also an issue with multiple versions leading
to schema disagreements.

Due to this scenario, a couple boxes are running very high on memory (95%
> usage) whereas some of the older version nodes have just 60-70% memory
> usage.


Hard to say if this is related to the mutiple versions of Cassandra but it
could. Are you sure nodes are using the same JVM / GC options
(cassandra-env.sh) and Java version?

Also, what is exactly "high on memory 95%"? Are we talking about heap or
Native memory. Isn't the memory used as page cache (that would still be
available for the system)?

2. To counter #1, I am planning to upgrade system configuration of the
> nodes where there is higher memory usage. But the question is, will it be a
> problem if we have a Cassandra cluster, where in a couple of nodes have
> double the system configuration than other nodes in the cluster.
>

It is not a problem per se to have distinct configurations on distinct
nodes. Cassandra does it very well, and it is frequently used to test some
configuration change on a canary node, to prevent it from impacting the
whole service.

Yet, all the nodes should be doing the same work (unless you have some
heterogenous hardware and are using distinct number of vnodes on each
node). Keeping things homogenous allows the operator to easily compare how
nodes are doing and it makes reasoning about Cassandra, as well as
troubleshooting issues a way easier.

So I would:

- Fully upgrade / downgrade asap to a chosen version (3.X is known as being
not yet stable, but going back to 3.0.X might be more painful)
- Make sure nodes are well balanced and using the same number of ranges
'nodetool status '
- Make sure the node are using the same Java version and JVM settings.

Hope that helps,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-12-21 8:22 GMT+01:00 Sumit Anvekar :

> I have a couple questions.
>
> 1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
> version 3.0.3 and then newer 5 nodes have 3.6.0 version. I has been running
> fine until recently I am seeing higher amount of data residing in newer
> boxes. The configuration file (YAML file) is exactly same on all nodes
> (except for the node host names). Wondering if the version has something to
> do with this scenario. Due to this scenario, a couple boxes are running
> very high on memory (95% usage) whereas some of the older version nodes
> have just 60-70% memory usage.
>
> 2. To counter #1, I am planning to upgrade system configuration of the
> nodes where there is higher memory usage. But the question is, will it be a
> problem if we have a Cassandra cluster, where in a couple of nodes have
> double the system configuration than other nodes in the cluster.
>
> Appreciate any comment on the same.
>
> Sumit.
>

Query on Cassandra clusters

2016-12-20 Thread Sumit Anvekar

I have a couple questions.

1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
version 3.0.3 and then newer 5 nodes have 3.6.0 version. I has been running
fine until recently I am seeing higher amount of data residing in newer
boxes. The configuration file (YAML file) is exactly same on all nodes
(except for the node host names). Wondering if the version has something to
do with this scenario. Due to this scenario, a couple boxes are running
very high on memory (95% usage) whereas some of the older version nodes
have just 60-70% memory usage.

2. To counter #1, I am planning to upgrade system configuration of the
nodes where there is higher memory usage. But the question is, will it be a
problem if we have a Cassandra cluster, where in a couple of nodes have
double the system configuration than other nodes in the cluster.

Appreciate any comment on the same.

Sumit.

Re: Why does `now()` produce different times within the same query?

2016-12-03 Thread Jeff Jirsa



On 2016-12-03 08:44 (-0800), Edward Capriolo  wrote: 
> On Sat, Dec 3, 2016 at 11:01 AM, Edward Capriolo 
> wrote:
> 
> >
> >
> >  A new unique timeuuid (at the time where the statement using it is
> > executed).
> >
> > Indicates that each statement has one unique time uuid. Calling the udf
> > twice in one statement and getting different results dissagrees with the
> > documentation.
> >
> 
> https://issues.apache.org/jira/browse/CASSANDRA-12989
> 

Reasonable change to me. doc change committed to trunk - it'll make its way to 
the site soon'ish.

Re: Why does `now()` produce different times within the same query?

2016-12-03 Thread Edward Capriolo

On Sat, Dec 3, 2016 at 11:01 AM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

>
>
> On Saturday, December 3, 2016, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>>
>>
>> On Saturday, December 3, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>
>>> That isn't what the original thread is about. The thread is about the
>>> timestamp portion of the UUID being different.
>>>
>>> Having UUID() return the same thing for all rows in a batch would be the
>>> unexpected thing virtually every time.
>>> On Sat, Dec 3, 2016 at 7:09 AM Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>>>
>>>>> This isn't about using the same UUID though. It's about the timestamp
>>>>> bits in the UUID.
>>>>>
>>>>> What the use case is for generating multiple UUIDs in a single row?
>>>>> Why do you need to extract the timestamp out of both?
>>>>> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <
>>>>>> sylv...@datastax.com> wrote:
>>>>>>
>>>>>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <
>>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I am not sure you saw my reply on thread but I believe everyone's
>>>>>>>> needs can be met I will copy that here:
>>>>>>>>
>>>>>>>
>>>>>>> I saw it, but the real problem that was raised initially was not
>>>>>>> that of UDF and of allowing both behavior. It's a matter of people being
>>>>>>> confused by the behavior of a non-UDF function, now(), and suggesting it
>>>>>>> should be changed.
>>>>>>>
>>>>>>> The Hive idea is interesting I guess, and we can switch to
>>>>>>> discussing that, but it's a different problem really and I'm not a fond 
>>>>>>> of
>>>>>>> derailing threads. I will just note though that if we're not talking 
>>>>>>> about
>>>>>>> a confusion issue but rather how to get a timeuuid to be fixed within a
>>>>>>> statement, then there is much much more trivial solution: generate it
>>>>>>> client side. The `now()` function is a small convenience but there is
>>>>>>> nothing you cannot do without it client side, and that actually 
>>>>>>> basically
>>>>>>> stands for almost any use of (non aggregate) function in Cassandra
>>>>>>> currently.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> "Food for thought: Hive's UDFs introduced an annotation
>>>>>>>> @UDFType(deterministic = false)
>>>>>>>>
>>>>>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map
>>>>>>>> -and-reduce-side-in-hive/
>>>>>>>>
>>>>>>>> The effect is the query planner can see when such a UDF is in use
>>>>>>>> and determine the value once at the start of a very long query."
>>>>>>>>
>>>>>>>> Essentially hive had a similar if not identical problem, during a
>>>>>>>> long running distributed process like map/reduce some users wanted the
>>>>>>>> semantics of:
>>>>>>>>
>>>>>>>> 1) Each call should have a new timestamps
>>>>>>>>
>>>>>>>> While other users wanted the semantics of:
>>>>>>>>
>>>>>>>> 2) Each call should generate the same timestamp
>>>>>>>>
>>>>>>>> The solution implemented was to add an annotation to udf such that
>>>>>>>> the query planner would pick up the annotation and act accordingly.
>>>>>>>>
>>>>>>>> (Here is a related issue https://issues.apache.or
>>>>>>>> g/jira/browse/HIVE-198

Re: Why does `now()` produce different times within the same query?

2016-12-03 Thread Edward Capriolo

On Saturday, December 3, 2016, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

>
>
> On Saturday, December 3, 2016, Jonathan Haddad <j...@jonhaddad.com
> <javascript:_e(%7B%7D,'cvml','j...@jonhaddad.com');>> wrote:
>
>> That isn't what the original thread is about. The thread is about the
>> timestamp portion of the UUID being different.
>>
>> Having UUID() return the same thing for all rows in a batch would be the
>> unexpected thing virtually every time.
>> On Sat, Dec 3, 2016 at 7:09 AM Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>>
>>>> This isn't about using the same UUID though. It's about the timestamp
>>>> bits in the UUID.
>>>>
>>>> What the use case is for generating multiple UUIDs in a single row? Why
>>>> do you need to extract the timestamp out of both?
>>>> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <
>>>>> sylv...@datastax.com> wrote:
>>>>>
>>>>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <
>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I am not sure you saw my reply on thread but I believe everyone's
>>>>>>> needs can be met I will copy that here:
>>>>>>>
>>>>>>
>>>>>> I saw it, but the real problem that was raised initially was not that
>>>>>> of UDF and of allowing both behavior. It's a matter of people being
>>>>>> confused by the behavior of a non-UDF function, now(), and suggesting it
>>>>>> should be changed.
>>>>>>
>>>>>> The Hive idea is interesting I guess, and we can switch to discussing
>>>>>> that, but it's a different problem really and I'm not a fond of derailing
>>>>>> threads. I will just note though that if we're not talking about a
>>>>>> confusion issue but rather how to get a timeuuid to be fixed within a
>>>>>> statement, then there is much much more trivial solution: generate it
>>>>>> client side. The `now()` function is a small convenience but there is
>>>>>> nothing you cannot do without it client side, and that actually basically
>>>>>> stands for almost any use of (non aggregate) function in Cassandra
>>>>>> currently.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> "Food for thought: Hive's UDFs introduced an annotation
>>>>>>> @UDFType(deterministic = false)
>>>>>>>
>>>>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map
>>>>>>> -and-reduce-side-in-hive/
>>>>>>>
>>>>>>> The effect is the query planner can see when such a UDF is in use
>>>>>>> and determine the value once at the start of a very long query."
>>>>>>>
>>>>>>> Essentially hive had a similar if not identical problem, during a
>>>>>>> long running distributed process like map/reduce some users wanted the
>>>>>>> semantics of:
>>>>>>>
>>>>>>> 1) Each call should have a new timestamps
>>>>>>>
>>>>>>> While other users wanted the semantics of:
>>>>>>>
>>>>>>> 2) Each call should generate the same timestamp
>>>>>>>
>>>>>>> The solution implemented was to add an annotation to udf such that
>>>>>>> the query planner would pick up the annotation and act accordingly.
>>>>>>>
>>>>>>> (Here is a related issue https://issues.apache.or
>>>>>>> g/jira/browse/HIVE-1986
>>>>>>>
>>>>>>> As a result you can essentially implement two UDFS
>>>>>>>
>>>>>>> @UDFType(deterministic = false)
>>>>>>> public class UDFNow
>>>>>>>
>>>>>>> and for the other people
>>>>>>>
>>>>>>> @UDFType(deterministic = true)
>>>>>>> public

Re: Why does `now()` produce different times within the same query?

2016-12-03 Thread Edward Capriolo

On Saturday, December 3, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:

> That isn't what the original thread is about. The thread is about the
> timestamp portion of the UUID being different.
>
> Having UUID() return the same thing for all rows in a batch would be the
> unexpected thing virtually every time.
> On Sat, Dec 3, 2016 at 7:09 AM Edward Capriolo <edlinuxg...@gmail.com
> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote:
>
>>
>>
>> On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com
>> <javascript:_e(%7B%7D,'cvml','j...@jonhaddad.com');>> wrote:
>>
>>> This isn't about using the same UUID though. It's about the timestamp
>>> bits in the UUID.
>>>
>>> What the use case is for generating multiple UUIDs in a single row? Why
>>> do you need to extract the timestamp out of both?
>>> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com
>>>> > wrote:
>>>>
>>>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>>
>>>>>> I am not sure you saw my reply on thread but I believe everyone's
>>>>>> needs can be met I will copy that here:
>>>>>>
>>>>>
>>>>> I saw it, but the real problem that was raised initially was not that
>>>>> of UDF and of allowing both behavior. It's a matter of people being
>>>>> confused by the behavior of a non-UDF function, now(), and suggesting it
>>>>> should be changed.
>>>>>
>>>>> The Hive idea is interesting I guess, and we can switch to discussing
>>>>> that, but it's a different problem really and I'm not a fond of derailing
>>>>> threads. I will just note though that if we're not talking about a
>>>>> confusion issue but rather how to get a timeuuid to be fixed within a
>>>>> statement, then there is much much more trivial solution: generate it
>>>>> client side. The `now()` function is a small convenience but there is
>>>>> nothing you cannot do without it client side, and that actually basically
>>>>> stands for almost any use of (non aggregate) function in Cassandra
>>>>> currently.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> "Food for thought: Hive's UDFs introduced an annotation  
>>>>>> @UDFType(deterministic
>>>>>> = false)
>>>>>>
>>>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-
>>>>>> map-and-reduce-side-in-hive/
>>>>>>
>>>>>> The effect is the query planner can see when such a UDF is in use and
>>>>>> determine the value once at the start of a very long query."
>>>>>>
>>>>>> Essentially hive had a similar if not identical problem, during a
>>>>>> long running distributed process like map/reduce some users wanted the
>>>>>> semantics of:
>>>>>>
>>>>>> 1) Each call should have a new timestamps
>>>>>>
>>>>>> While other users wanted the semantics of:
>>>>>>
>>>>>> 2) Each call should generate the same timestamp
>>>>>>
>>>>>> The solution implemented was to add an annotation to udf such that
>>>>>> the query planner would pick up the annotation and act accordingly.
>>>>>>
>>>>>> (Here is a related issue https://issues.apache.
>>>>>> org/jira/browse/HIVE-1986
>>>>>>
>>>>>> As a result you can essentially implement two UDFS
>>>>>>
>>>>>> @UDFType(deterministic = false)
>>>>>> public class UDFNow
>>>>>>
>>>>>> and for the other people
>>>>>>
>>>>>> @UDFType(deterministic = true)
>>>>>> public class UDFNowOnce extends UDFNow
>>>>>>
>>>>>> Both user cases are met in a sensible way.
>>>>>>
>>>>>
>>>>>
>>>> The `now()` function is a small convenience but there is nothing you
>>>> cannot do without it client side, and that actua

Re: Why does `now()` produce different times within the same query?

2016-12-03 Thread Jonathan Haddad

That isn't what the original thread is about. The thread is about the
timestamp portion of the UUID being different.

Having UUID() return the same thing for all rows in a batch would be the
unexpected thing virtually every time.
On Sat, Dec 3, 2016 at 7:09 AM Edward Capriolo <edlinuxg...@gmail.com>
wrote:

>
>
> On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:
>
> This isn't about using the same UUID though. It's about the timestamp bits
> in the UUID.
>
> What the use case is for generating multiple UUIDs in a single row? Why do
> you need to extract the timestamp out of both?
> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>
> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com>
> wrote:
>
> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>
> I am not sure you saw my reply on thread but I believe everyone's needs
> can be met I will copy that here:
>
>
> I saw it, but the real problem that was raised initially was not that of
> UDF and of allowing both behavior. It's a matter of people being confused
> by the behavior of a non-UDF function, now(), and suggesting it should be
> changed.
>
> The Hive idea is interesting I guess, and we can switch to discussing
> that, but it's a different problem really and I'm not a fond of derailing
> threads. I will just note though that if we're not talking about a
> confusion issue but rather how to get a timeuuid to be fixed within a
> statement, then there is much much more trivial solution: generate it
> client side. The `now()` function is a small convenience but there is
> nothing you cannot do without it client side, and that actually basically
> stands for almost any use of (non aggregate) function in Cassandra
> currently.
>
>
>
>
> "Food for thought: Hive's UDFs introduced an annotation  
> @UDFType(deterministic
> = false)
>
>
> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map-and-reduce-side-in-hive/
>
> The effect is the query planner can see when such a UDF is in use and
> determine the value once at the start of a very long query."
>
> Essentially hive had a similar if not identical problem, during a long
> running distributed process like map/reduce some users wanted the semantics
> of:
>
> 1) Each call should have a new timestamps
>
> While other users wanted the semantics of:
>
> 2) Each call should generate the same timestamp
>
> The solution implemented was to add an annotation to udf such that the
> query planner would pick up the annotation and act accordingly.
>
> (Here is a related issue https://issues.apache.org/jira/browse/HIVE-1986
>
> As a result you can essentially implement two UDFS
>
> @UDFType(deterministic = false)
> public class UDFNow
>
> and for the other people
>
> @UDFType(deterministic = true)
> public class UDFNowOnce extends UDFNow
>
> Both user cases are met in a sensible way.
>
>
>
> The `now()` function is a small convenience but there is nothing you
> cannot do without it client side, and that actually basically stands for
> almost any use of (non aggregate) function in Cassandra currently.
>
> Casandra's changing philosophy over which entity should create such
> information client/server/driver does not make this problem easy.
>
> If you take into account that you have users who do not understand all the
> intricacy of uuid the problem is compounded. IE How does one generate a
> UUID each c#, python, java etc? with the 47 random bits of bla bla. That is
> not super easy information to find. Maybe you find a stack overflow post
> that actually gives bad advice etc.
>
> Many times in Cassandra you are using a uuid because you do not have a
> unique key in the insert and you wish to create one. If you are inserting
> more then a single record using that same UUID and you do not want the
> burden of wanting to do it yourself you would have to do write>>read>>write
> which is an anti-pattern.
>
>
> Not multiple ids for a single row. The same id for multiple inserts in a
> batch.
>
> For example lets say I have an application where my data has no unique
> key.
>
> Table poke
> Poker, pokee, time
>
> Suppose i consume pokes from kafka build a batch of 30k and insert them.
> You probably want to denormalize into two tables:
> Primary key (poker, time)
> Primary key (pokee,time)
>
> It makes sense that they all have the same uuid if you want it to be the
> uuid of the batch. This would make it easy to correlate all the events.
> Easy to delete them all as well.
>
> The do it client side argument is totally valid, but has been a
> justification for not adding features many of which are eventually added
> anyway.
>
>
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>

Re: Why does `now()` produce different times within the same query?

2016-12-03 Thread Edward Capriolo

On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:

> This isn't about using the same UUID though. It's about the timestamp bits
> in the UUID.
>
> What the use case is for generating multiple UUIDs in a single row? Why do
> you need to extract the timestamp out of both?
> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com
> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote:
>
>>
>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com
>> <javascript:_e(%7B%7D,'cvml','sylv...@datastax.com');>> wrote:
>>
>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote:
>>>
>>>>
>>>> I am not sure you saw my reply on thread but I believe everyone's needs
>>>> can be met I will copy that here:
>>>>
>>>
>>> I saw it, but the real problem that was raised initially was not that of
>>> UDF and of allowing both behavior. It's a matter of people being confused
>>> by the behavior of a non-UDF function, now(), and suggesting it should be
>>> changed.
>>>
>>> The Hive idea is interesting I guess, and we can switch to discussing
>>> that, but it's a different problem really and I'm not a fond of derailing
>>> threads. I will just note though that if we're not talking about a
>>> confusion issue but rather how to get a timeuuid to be fixed within a
>>> statement, then there is much much more trivial solution: generate it
>>> client side. The `now()` function is a small convenience but there is
>>> nothing you cannot do without it client side, and that actually basically
>>> stands for almost any use of (non aggregate) function in Cassandra
>>> currently.
>>>
>>>
>>>>
>>>>
>>>> "Food for thought: Hive's UDFs introduced an annotation  
>>>> @UDFType(deterministic
>>>> = false)
>>>>
>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-
>>>> map-and-reduce-side-in-hive/
>>>>
>>>> The effect is the query planner can see when such a UDF is in use and
>>>> determine the value once at the start of a very long query."
>>>>
>>>> Essentially hive had a similar if not identical problem, during a long
>>>> running distributed process like map/reduce some users wanted the semantics
>>>> of:
>>>>
>>>> 1) Each call should have a new timestamps
>>>>
>>>> While other users wanted the semantics of:
>>>>
>>>> 2) Each call should generate the same timestamp
>>>>
>>>> The solution implemented was to add an annotation to udf such that the
>>>> query planner would pick up the annotation and act accordingly.
>>>>
>>>> (Here is a related issue https://issues.apache.
>>>> org/jira/browse/HIVE-1986
>>>>
>>>> As a result you can essentially implement two UDFS
>>>>
>>>> @UDFType(deterministic = false)
>>>> public class UDFNow
>>>>
>>>> and for the other people
>>>>
>>>> @UDFType(deterministic = true)
>>>> public class UDFNowOnce extends UDFNow
>>>>
>>>> Both user cases are met in a sensible way.
>>>>
>>>
>>>
>> The `now()` function is a small convenience but there is nothing you
>> cannot do without it client side, and that actually basically stands for
>> almost any use of (non aggregate) function in Cassandra currently.
>>
>> Casandra's changing philosophy over which entity should create such
>> information client/server/driver does not make this problem easy.
>>
>> If you take into account that you have users who do not understand all
>> the intricacy of uuid the problem is compounded. IE How does one generate a
>> UUID each c#, python, java etc? with the 47 random bits of bla bla. That is
>> not super easy information to find. Maybe you find a stack overflow post
>> that actually gives bad advice etc.
>>
>> Many times in Cassandra you are using a uuid because you do not have a
>> unique key in the insert and you wish to create one. If you are inserting
>> more then a single record using that same UUID and you do not want the
>> burden of wanting to do it yourself you would have to do write>>read>>write
>> which is an anti-pattern.
>>
>
Not multiple ids for a single row. The same id for multiple inserts in a
batch.

For example lets say I have an application where my data has no unique key.

Table poke
Poker, pokee, time

Suppose i consume pokes from kafka build a batch of 30k and insert them.
You probably want to denormalize into two tables:
Primary key (poker, time)
Primary key (pokee,time)

It makes sense that they all have the same uuid if you want it to be the
uuid of the batch. This would make it easy to correlate all the events.
Easy to delete them all as well.

The do it client side argument is totally valid, but has been a
justification for not adding features many of which are eventually added
anyway.




-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Why does `now()` produce different times within the same query?

2016-12-02 Thread Jonathan Haddad

This isn't about using the same UUID though. It's about the timestamp bits
in the UUID.

What the use case is for generating multiple UUIDs in a single row? Why do
you need to extract the timestamp out of both?
On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com>
wrote:

>
> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com>
> wrote:
>
> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>
> I am not sure you saw my reply on thread but I believe everyone's needs
> can be met I will copy that here:
>
>
> I saw it, but the real problem that was raised initially was not that of
> UDF and of allowing both behavior. It's a matter of people being confused
> by the behavior of a non-UDF function, now(), and suggesting it should be
> changed.
>
> The Hive idea is interesting I guess, and we can switch to discussing
> that, but it's a different problem really and I'm not a fond of derailing
> threads. I will just note though that if we're not talking about a
> confusion issue but rather how to get a timeuuid to be fixed within a
> statement, then there is much much more trivial solution: generate it
> client side. The `now()` function is a small convenience but there is
> nothing you cannot do without it client side, and that actually basically
> stands for almost any use of (non aggregate) function in Cassandra
> currently.
>
>
>
>
> "Food for thought: Hive's UDFs introduced an annotation  
> @UDFType(deterministic
> = false)
>
>
> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map-and-reduce-side-in-hive/
>
> The effect is the query planner can see when such a UDF is in use and
> determine the value once at the start of a very long query."
>
> Essentially hive had a similar if not identical problem, during a long
> running distributed process like map/reduce some users wanted the semantics
> of:
>
> 1) Each call should have a new timestamps
>
> While other users wanted the semantics of:
>
> 2) Each call should generate the same timestamp
>
> The solution implemented was to add an annotation to udf such that the
> query planner would pick up the annotation and act accordingly.
>
> (Here is a related issue https://issues.apache.org/jira/browse/HIVE-1986
>
> As a result you can essentially implement two UDFS
>
> @UDFType(deterministic = false)
> public class UDFNow
>
> and for the other people
>
> @UDFType(deterministic = true)
> public class UDFNowOnce extends UDFNow
>
> Both user cases are met in a sensible way.
>
>
>
> The `now()` function is a small convenience but there is nothing you
> cannot do without it client side, and that actually basically stands for
> almost any use of (non aggregate) function in Cassandra currently.
>
> Casandra's changing philosophy over which entity should create such
> information client/server/driver does not make this problem easy.
>
> If you take into account that you have users who do not understand all the
> intricacy of uuid the problem is compounded. IE How does one generate a
> UUID each c#, python, java etc? with the 47 random bits of bla bla. That is
> not super easy information to find. Maybe you find a stack overflow post
> that actually gives bad advice etc.
>
> Many times in Cassandra you are using a uuid because you do not have a
> unique key in the insert and you wish to create one. If you are inserting
> more then a single record using that same UUID and you do not want the
> burden of wanting to do it yourself you would have to do write>>read>>write
> which is an anti-pattern.
>

Re: Why does `now()` produce different times within the same query?

2016-12-02 Thread Edward Capriolo

On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com>
wrote:

> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>>
>> I am not sure you saw my reply on thread but I believe everyone's needs
>> can be met I will copy that here:
>>
>
> I saw it, but the real problem that was raised initially was not that of
> UDF and of allowing both behavior. It's a matter of people being confused
> by the behavior of a non-UDF function, now(), and suggesting it should be
> changed.
>
> The Hive idea is interesting I guess, and we can switch to discussing
> that, but it's a different problem really and I'm not a fond of derailing
> threads. I will just note though that if we're not talking about a
> confusion issue but rather how to get a timeuuid to be fixed within a
> statement, then there is much much more trivial solution: generate it
> client side. The `now()` function is a small convenience but there is
> nothing you cannot do without it client side, and that actually basically
> stands for almost any use of (non aggregate) function in Cassandra
> currently.
>
>
>>
>>
>> "Food for thought: Hive's UDFs introduced an annotation
>> @UDFType(deterministic = false)
>>
>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map
>> -and-reduce-side-in-hive/
>>
>> The effect is the query planner can see when such a UDF is in use and
>> determine the value once at the start of a very long query."
>>
>> Essentially hive had a similar if not identical problem, during a long
>> running distributed process like map/reduce some users wanted the semantics
>> of:
>>
>> 1) Each call should have a new timestamps
>>
>> While other users wanted the semantics of:
>>
>> 2) Each call should generate the same timestamp
>>
>> The solution implemented was to add an annotation to udf such that the
>> query planner would pick up the annotation and act accordingly.
>>
>> (Here is a related issue https://issues.apache.org/jira/browse/HIVE-1986
>>
>> As a result you can essentially implement two UDFS
>>
>> @UDFType(deterministic = false)
>> public class UDFNow
>>
>> and for the other people
>>
>> @UDFType(deterministic = true)
>> public class UDFNowOnce extends UDFNow
>>
>> Both user cases are met in a sensible way.
>>
>
>
The `now()` function is a small convenience but there is nothing you cannot
do without it client side, and that actually basically stands for almost
any use of (non aggregate) function in Cassandra currently.

Casandra's changing philosophy over which entity should create such
information client/server/driver does not make this problem easy.

If you take into account that you have users who do not understand all the
intricacy of uuid the problem is compounded. IE How does one generate a
UUID each c#, python, java etc? with the 47 random bits of bla bla. That is
not super easy information to find. Maybe you find a stack overflow post
that actually gives bad advice etc.

Many times in Cassandra you are using a uuid because you do not have a
unique key in the insert and you wish to create one. If you are inserting
more then a single record using that same UUID and you do not want the
burden of wanting to do it yourself you would have to do write>>read>>write
which is an anti-pattern.

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Ben Bromhead

>
>
>
> I will note that Ben seems to suggest keeping the return of now() unique
> across
> call while keeping the time component equals, thus varying the rest of the
> uuid
> bytes. However:
>  - I'm starting to wonder what this would buy us. Why would someone be
> super
>confused by the time changing across calls (in a single
> statement/batch), but
>be totally not confused by the actual full return to not be equal?
>
Given that a common way of interacting with timeuuids is with toTimestamp I
can see the confusion and assumptions around behaviour.

And how is
>that actually useful: you're having different result anyway and you're
>letting the server pick the timestamp in the first place, so you're
> probably
>not caring about milliseconds precision of that timestamp in the first
> place.
>
If you want consistency of timestamps within your query as OP did I can see
how this is useful. Postgres claims this is a "feature".

 - This would basically be a violation of the timeuuid spec
>

Not quite... Type 1 uuids let you swap out the low 47 bits of the node
component with other randomly generated bits (
https://www.ietf.org/rfc/rfc4122.txt)

 - This would be a big pain in the code and make of now() a special case
> among functions. I'm unconvinced special cases are making things easier
> in general.
>

On reflection, I have to agree here, now() has been around for ever and
this is the first anecdote I've seen of someone getting caught out.

However with my user advocate hat on I think it would be worth
investigating further beyond a documentation update if others found it a
sticking point in Cassandra adoption.


> So I'm all for improving the documentation if this confuses users due to
> expectations (mistakenly) carried from prior experiences, and please
> feel free to open a JIRA for that. I'm a lot less in agreement that there
> is
> something wrong with the way the function behave in principle.
>


> > I can see why this issue has been largely ignored and hasn't had a
> chance for
> > the behaviour to be formally defined
>
> Don't make too much assumptions. The behavior is perfectly well defined:
> now()
> is a "normal" function and is evaluated whenever it's called according to
> the
> timeuuid spec (or as close to it as we can make it).
>
Maybe formally defined is the wrong term... Formally documented?

>
> On Thu, Dec 1, 2016 at 7:25 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> Great comment. +1
>
> Am 01.12.2016 06:29 schrieb "Ben Bromhead" <b...@instaclustr.com>:
>
> tl;dr +1 yup raise a jira to discuss how now() should behave in a single
> statement (and possible extend to batch statements).
>
> The values of now should be the same if you assume that now() works like
> it does in relational databases such as postgres or mysql, however at the
> moment it instead works like sysdate() in mysql. Given that CQL is supposed
> to be SQL like, I think the assumption around the behaviour of now() was a
> fair one to make.
>
> I definitely agree that raising a jira ticket would be a great place to
> discuss what the behaviour of now() should be for Cassandra. Personally I
> would be in favour of seeing the deterministic component (the actual time
> part) being the same across multiple calls in the one statement or multiple
> statements in a batch.
>
> Cassandra documentation does not make any claims as to how now() works
> within a single statement and reading the code it shows the intent is to
> work like sysdate() from MySQL rather than now(). One of the identified
> dangers of making cql similar to sql is that, while yes it aids adoption,
> users will find that SQL like things don't behave as expected. Of course as
> a user, one shouldn't have to read the source code to determine correct
> behaviour.
>
> Given that a timeuuid is made up of deterministic and (pseudo)
> non-deterministic components I can see why this issue has been largely
> ignored and hasn't had a chance for the behaviour to be formally defined
> (you would expect now to return the same time in the one statement despite
> multiple calls, but you wouldn't expect the same behaviour for say a call
> to rand()).
>
>
>
>
>
>
>
> On Wed, 30 Nov 2016 at 19:54 Cody Yancey <yan...@uber.com> wrote:
>
> This is not a bug, and in fact changing it would be a serious bug.
>
> False. Absolutely no consumer would be broken by a change to guarantee an
> identical time component that isn't broken already, for the simple reason
> your code already has to handle that case, as it is in fact the majority
> case RIGHT NOW. Users can hit this bug, in production, because unit tests
> might

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Marko Švaljek

One millisecond is not an issue in most of Internet of Things projects out
there. There are lots of connection related things that add far more
latency to the requests than that. Especially if you take into account the
time it takes for the data to actually come to a cassandra node in the
background etc. I'm simply not aware of any larger projects where edge
devices write directly to cassandra.

Requests almost always come in to some sort of gateway before that. The
usual pattern is storing the timestamp measured on the device (if it even
has own clock) and timestamp when it was received on the platform side.
Having two same timestamps on millisecond level in one insert statement
generated by now() simply doesn't add that much to the table.

Only case that comes to my mind would be when there is time series
bucketing of inserts and placing measurements in partitions based on some
sort of a mapping function with the results of now() but then again this is
usually done on the server side, I'm not sure it would be best practice to
do it within the insert.

Even if it would be done that way, analytics (be it near real time or
batch) usually takes that kind of things into account and compensates -
reports rarely show millisecond level dynamics.

In the end it just wouldn't be a good idea to change behaviour of a
function being around for quite some time.

@msvaljek 

2016-12-01 18:10 GMT+01:00 Cody Yancey :

> On Thu, Dec 1, 2016 at 11:09 AM Sylvain Lebresne 
> wrote:
>
>> there is much much more trivial solution: generate it client side. The
>> `now()` function is a small convenience but there is nothing you cannot do
>> without it client side
>>
>
> Please see my post above as to why this is a bad idea for inserts based on
> request time where knowing the time the request was made is actually
> important.
>
> Cody
>
>>
>

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Cody Yancey

On Thu, Dec 1, 2016 at 11:09 AM Sylvain Lebresne 
wrote:

> there is much much more trivial solution: generate it client side. The
> `now()` function is a small convenience but there is nothing you cannot do
> without it client side
>

Please see my post above as to why this is a bad idea for inserts based on
request time where knowing the time the request was made is actually
important.

Cody

>

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Edward Capriolo

On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com>
wrote:

> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>>
>> I am not sure you saw my reply on thread but I believe everyone's needs
>> can be met I will copy that here:
>>
>
> I saw it, but the real problem that was raised initially was not that of
> UDF and of allowing both behavior. It's a matter of people being confused
> by the behavior of a non-UDF function, now(), and suggesting it should be
> changed.
>
> The Hive idea is interesting I guess, and we can switch to discussing
> that, but it's a different problem really and I'm not a fond of derailing
> threads. I will just note though that if we're not talking about a
> confusion issue but rather how to get a timeuuid to be fixed within a
> statement, then there is much much more trivial solution: generate it
> client side. The `now()` function is a small convenience but there is
> nothing you cannot do without it client side, and that actually basically
> stands for almost any use of (non aggregate) function in Cassandra
> currently.
>
>
>>
>>
>> "Food for thought: Hive's UDFs introduced an annotation
>> @UDFType(deterministic = false)
>>
>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map
>> -and-reduce-side-in-hive/
>>
>> The effect is the query planner can see when such a UDF is in use and
>> determine the value once at the start of a very long query."
>>
>> Essentially hive had a similar if not identical problem, during a long
>> running distributed process like map/reduce some users wanted the semantics
>> of:
>>
>> 1) Each call should have a new timestamps
>>
>> While other users wanted the semantics of:
>>
>> 2) Each call should generate the same timestamp
>>
>> The solution implemented was to add an annotation to udf such that the
>> query planner would pick up the annotation and act accordingly.
>>
>> (Here is a related issue https://issues.apache.org/jira/browse/HIVE-1986
>>
>> As a result you can essentially implement two UDFS
>>
>> @UDFType(deterministic = false)
>> public class UDFNow
>>
>> and for the other people
>>
>> @UDFType(deterministic = true)
>> public class UDFNowOnce extends UDFNow
>>
>> Both user cases are met in a sensible way.
>>
>
>
I agree that changing the semantics of something already in existence is a
bad idea. What is there "now" no pun on works should stay working as is.

I will also point out that presto addresses this issue with specific
functions:

https://prestodb.io/docs/current/functions/datetime.html

localtime -> time

Returns the current time as of the start of the query.
localtimestamp -> timestamp

Returns the current timestamp as of the start of the query.

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Jonathan Haddad

+1 to everything Sylvan said.
On Thu, Dec 1, 2016 at 11:09 AM Sylvain Lebresne <sylv...@datastax.com>
wrote:

> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>
> I am not sure you saw my reply on thread but I believe everyone's needs
> can be met I will copy that here:
>
>
> I saw it, but the real problem that was raised initially was not that of
> UDF and of allowing both behavior. It's a matter of people being confused
> by the behavior of a non-UDF function, now(), and suggesting it should be
> changed.
>
> The Hive idea is interesting I guess, and we can switch to discussing
> that, but it's a different problem really and I'm not a fond of derailing
> threads. I will just note though that if we're not talking about a
> confusion issue but rather how to get a timeuuid to be fixed within a
> statement, then there is much much more trivial solution: generate it
> client side. The `now()` function is a small convenience but there is
> nothing you cannot do without it client side, and that actually basically
> stands for almost any use of (non aggregate) function in Cassandra
> currently.
>
>
>
>
> "Food for thought: Hive's UDFs introduced an annotation  
> @UDFType(deterministic
> = false)
>
>
> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map-and-reduce-side-in-hive/
>
> The effect is the query planner can see when such a UDF is in use and
> determine the value once at the start of a very long query."
>
> Essentially hive had a similar if not identical problem, during a long
> running distributed process like map/reduce some users wanted the semantics
> of:
>
> 1) Each call should have a new timestamps
>
> While other users wanted the semantics of:
>
> 2) Each call should generate the same timestamp
>
> The solution implemented was to add an annotation to udf such that the
> query planner would pick up the annotation and act accordingly.
>
> (Here is a related issue https://issues.apache.org/jira/browse/HIVE-1986
>
> As a result you can essentially implement two UDFS
>
> @UDFType(deterministic = false)
> public class UDFNow
>
> and for the other people
>
> @UDFType(deterministic = true)
> public class UDFNowOnce extends UDFNow
>
> Both user cases are met in a sensible way.
>
>

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Sylvain Lebresne

On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

>
> I am not sure you saw my reply on thread but I believe everyone's needs
> can be met I will copy that here:
>

I saw it, but the real problem that was raised initially was not that of
UDF and of allowing both behavior. It's a matter of people being confused
by the behavior of a non-UDF function, now(), and suggesting it should be
changed.

The Hive idea is interesting I guess, and we can switch to discussing that,
but it's a different problem really and I'm not a fond of derailing
threads. I will just note though that if we're not talking about a
confusion issue but rather how to get a timeuuid to be fixed within a
statement, then there is much much more trivial solution: generate it
client side. The `now()` function is a small convenience but there is
nothing you cannot do without it client side, and that actually basically
stands for almost any use of (non aggregate) function in Cassandra
currently.

>
>
> "Food for thought: Hive's UDFs introduced an annotation
> @UDFType(deterministic = false)
>
> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map
> -and-reduce-side-in-hive/
>
> The effect is the query planner can see when such a UDF is in use and
> determine the value once at the start of a very long query."
>
> Essentially hive had a similar if not identical problem, during a long
> running distributed process like map/reduce some users wanted the semantics
> of:
>
> 1) Each call should have a new timestamps
>
> While other users wanted the semantics of:
>
> 2) Each call should generate the same timestamp
>
> The solution implemented was to add an annotation to udf such that the
> query planner would pick up the annotation and act accordingly.
>
> (Here is a related issue https://issues.apache.org/jira/browse/HIVE-1986
>
> As a result you can essentially implement two UDFS
>
> @UDFType(deterministic = false)
> public class UDFNow
>
> and for the other people
>
> @UDFType(deterministic = true)
> public class UDFNowOnce extends UDFNow
>
> Both user cases are met in a sensible way.
>

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Bruce Heath

Get Outlook for Android<https://aka.ms/ghei36>

From: Edward Capriolo <edlinuxg...@gmail.com>
Sent: Thursday, December 1, 2016 10:44:10 AM
To: user@cassandra.apache.org
Subject: Re: Why does `now()` produce different times within the same query?

On Thu, Dec 1, 2016 at 4:06 AM, Sylvain Lebresne 
<sylv...@datastax.com<mailto:sylv...@datastax.com>> wrote:
One can of course always open a JIRA, but I'm going to strongly disagree with a
change here (outside of a documentation one that is).

The now() function is a timeuuid generator, and it thus generates a unique
timeuuid on every call, as specified by the timeuuid spec. I'll note that
document lists it under "Timeuuid functions", and has sentences like
"the value returned by now() is guaranteed to be unique", so while I'm sure the
documentation can be further clarified, I think it's pretty clear it's not the
now() of SQL, and getting unique values on every call shouldn't be *that*
surprising.

Also, now() was primarily meant for use on timeuuid clustering columns for a
time-series like table, something like:
  CREATE TABLE ts (
k int,
t timeuuid,
v text,
PRIMARY KEY (k, t)
  )
and if you use it multiple times in a batch, this would look something like:
  BEGIN BATCH
INSERT INTO ts (k, t, v) VALUES (0, now(), 'foo');
INSERT INTO ts (k, t, v) VALUES (0, now(), 'bar');
  APPLY BATCH
and you definitively want that to insert 2 "events", not just one.

This is also why changing the behavior of this method *would* be a breaking
change.

Another reason this work the way it is is that functions in CQL are just that,
functions. Each execution is unique and they have no notion of being executed in
the same statement/batch/whatever. I actually think this is sensible, assuming
one stops being obsessed with what other databases that aren't Apache Cassandra
do.

I will note that Ben seems to suggest keeping the return of now() unique across
call while keeping the time component equals, thus varying the rest of the uuid
bytes. However:
 - I'm starting to wonder what this would buy us. Why would someone be super
   confused by the time changing across calls (in a single statement/batch), but
   be totally not confused by the actual full return to not be equal? And how is
   that actually useful: you're having different result anyway and you're
   letting the server pick the timestamp in the first place, so you're probably
   not caring about milliseconds precision of that timestamp in the first place.
 - This would basically be a violation of the timeuuid spec
 - This would be a big pain in the code and make of now() a special case
among functions. I'm unconvinced special cases are making things easier
in general.

So I'm all for improving the documentation if this confuses users due to
expectations (mistakenly) carried from prior experiences, and please
feel free to open a JIRA for that. I'm a lot less in agreement that there is
something wrong with the way the function behave in principle.

> I can see why this issue has been largely ignored and hasn't had a chance for
> the behaviour to be formally defined

Don't make too much assumptions. The behavior is perfectly well defined: now()
is a "normal" function and is evaluated whenever it's called according to the
timeuuid spec (or as close to it as we can make it).

On Thu, Dec 1, 2016 at 7:25 AM, Benjamin Roth 
<benjamin.r...@jaumo.com<mailto:benjamin.r...@jaumo.com>> wrote:

Great comment. +1

Am 01.12.2016 06:29 schrieb "Ben Bromhead" 
<b...@instaclustr.com<mailto:b...@instaclustr.com>>:
tl;dr +1 yup raise a jira to discuss how now() should behave in a single 
statement (and possible extend to batch statements).

The values of now should be the same if you assume that now() works like it 
does in relational databases such as postgres or mysql, however at the moment 
it instead works like sysdate() in mysql. Given that CQL is supposed to be SQL 
like, I think the assumption around the behaviour of now() was a fair one to 
make.

I definitely agree that raising a jira ticket would be a great place to discuss 
what the behaviour of now() should be for Cassandra. Personally I would be in 
favour of seeing the deterministic component (the actual time part) being the 
same across multiple calls in the one statement or multiple statements in a 
batch.

Cassandra documentation does not make any claims as to how now() works within a 
single statement and reading the code it shows the intent is to work like 
sysdate() from MySQL rather than now(). One of the identified dangers of making 
cql similar to sql is that, while yes it aids adoption, users will find that 
SQL like things don't behave as expected. Of course as a user, one shouldn't 
have to read the source code to determine correct behaviour.

Given that a timeuuid is made up of deterministic and

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Edward Capriolo

ilar to sql is that, while yes it aids adoption,
>>> users will find that SQL like things don't behave as expected. Of course as
>>> a user, one shouldn't have to read the source code to determine correct
>>> behaviour.
>>>
>>> Given that a timeuuid is made up of deterministic and (pseudo)
>>> non-deterministic components I can see why this issue has been largely
>>> ignored and hasn't had a chance for the behaviour to be formally defined
>>> (you would expect now to return the same time in the one statement despite
>>> multiple calls, but you wouldn't expect the same behaviour for say a call
>>> to rand()).
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 30 Nov 2016 at 19:54 Cody Yancey <yan...@uber.com> wrote:
>>>
>>>> This is not a bug, and in fact changing it would be a serious bug.
>>>>
>>>> False. Absolutely no consumer would be broken by a change to guarantee
>>>> an identical time component that isn't broken already, for the simple
>>>> reason your code already has to handle that case, as it is in fact the
>>>> majority case RIGHT NOW. Users can hit this bug, in production, because
>>>> unit tests might not experienced it! The time component should be the time
>>>> that the command was processed by the coordinator node.
>>>>
>>>>  would one expect a java/py/bash script that loops
>>>>
>>>> Individual Cassandra writes (which is what OP is referring to
>>>> specifically) are not loops. They are in almost every case atomic
>>>> operations that either succeed completely or fail completely. Allowing a
>>>> single atomic operation to witness multiple times in these corner cases is
>>>> not only surprising, as this thread demonstrates, it is also needlessly
>>>> restricting to what developers can use the database for, and provides NO
>>>> BENEFIT.
>>>>
>>>> Calling now PRIOR to initiating multiple inserts is in most cases
>>>> exactly what one does...the ONLY practice is to set the value before
>>>> initiating the sequence of calls
>>>>
>>>> Also false. Cassandra does not have a way of doing this on the
>>>> coordinator node rather than the client device, and as I already showed,
>>>> the client device is the wrong place to do it in situations where
>>>> guaranteeing bounded clock-skew actually makes a difference one way or the
>>>> other.
>>>>
>>>> Thanks,
>>>> Cody
>>>>
>>>>
>>>>
>>>> On Wed, Nov 30, 2016 at 8:02 PM, daemeon reiydelle <daeme...@gmail.com>
>>>> wrote:
>>>>
>>>> This is not a bug, and in fact changing it would be a serious bug.
>>>>
>>>> What it is is a wonderful case of bad coding: would one expect a
>>>> java/py/bash script that loops on a bunch of read/execut/update calls where
>>>> each iteration calls time to return the same exact time for the duration of
>>>> the execution of the code? Whether the code runs for 5 seconds or 5 hours?
>>>>
>>>> Every call to a system call is unique, including within C*. Calling now
>>>> PRIOR to initiating multiple inserts is in most cases exactly what one does
>>>> to assure unique time stamps FOR THE BATCH OF INSERTS. To get a nearly
>>>> identical system time as would be the uuid of the row, one tries to call
>>>> time as close to just before the insert as possible. Then repeat.
>>>>
>>>> You have a logic issue in your code. If you want the same value for a
>>>> set of calls, the ONLY practice is to set the value before initiating the
>>>> sequence of calls.
>>>>
>>>>
>>>>
>>>> *...*
>>>>
>>>>
>>>>
>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>>
>>>> On Wed, Nov 30, 2016 at 6:16 PM, Cody Yancey <yan...@uber.com> wrote:
>>>>
>>>> Getting the same TimeUUID values might be a major problem. Getting two
>>>> different TimeUUIDs that at least have time component would not be a major
>>>> problem as this is the main case today. Getting different time components
>>>> is actually the corner case, and it is a corner case that breaks
>>&g

Re: Why does `now()` produce different times within the same query?

2016-12-01 Thread Sylvain Lebresne

e calls, but you wouldn't expect the same behaviour for say a call
>> to rand()).
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 30 Nov 2016 at 19:54 Cody Yancey <yan...@uber.com> wrote:
>>
>>> This is not a bug, and in fact changing it would be a serious bug.
>>>
>>> False. Absolutely no consumer would be broken by a change to guarantee
>>> an identical time component that isn't broken already, for the simple
>>> reason your code already has to handle that case, as it is in fact the
>>> majority case RIGHT NOW. Users can hit this bug, in production, because
>>> unit tests might not experienced it! The time component should be the time
>>> that the command was processed by the coordinator node.
>>>
>>>  would one expect a java/py/bash script that loops
>>>
>>> Individual Cassandra writes (which is what OP is referring to
>>> specifically) are not loops. They are in almost every case atomic
>>> operations that either succeed completely or fail completely. Allowing a
>>> single atomic operation to witness multiple times in these corner cases is
>>> not only surprising, as this thread demonstrates, it is also needlessly
>>> restricting to what developers can use the database for, and provides NO
>>> BENEFIT.
>>>
>>> Calling now PRIOR to initiating multiple inserts is in most cases
>>> exactly what one does...the ONLY practice is to set the value before
>>> initiating the sequence of calls
>>>
>>> Also false. Cassandra does not have a way of doing this on the
>>> coordinator node rather than the client device, and as I already showed,
>>> the client device is the wrong place to do it in situations where
>>> guaranteeing bounded clock-skew actually makes a difference one way or the
>>> other.
>>>
>>> Thanks,
>>> Cody
>>>
>>>
>>>
>>> On Wed, Nov 30, 2016 at 8:02 PM, daemeon reiydelle <daeme...@gmail.com>
>>> wrote:
>>>
>>> This is not a bug, and in fact changing it would be a serious bug.
>>>
>>> What it is is a wonderful case of bad coding: would one expect a
>>> java/py/bash script that loops on a bunch of read/execut/update calls where
>>> each iteration calls time to return the same exact time for the duration of
>>> the execution of the code? Whether the code runs for 5 seconds or 5 hours?
>>>
>>> Every call to a system call is unique, including within C*. Calling now
>>> PRIOR to initiating multiple inserts is in most cases exactly what one does
>>> to assure unique time stamps FOR THE BATCH OF INSERTS. To get a nearly
>>> identical system time as would be the uuid of the row, one tries to call
>>> time as close to just before the insert as possible. Then repeat.
>>>
>>> You have a logic issue in your code. If you want the same value for a
>>> set of calls, the ONLY practice is to set the value before initiating the
>>> sequence of calls.
>>>
>>>
>>>
>>> *...*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>
>>> On Wed, Nov 30, 2016 at 6:16 PM, Cody Yancey <yan...@uber.com> wrote:
>>>
>>> Getting the same TimeUUID values might be a major problem. Getting two
>>> different TimeUUIDs that at least have time component would not be a major
>>> problem as this is the main case today. Getting different time components
>>> is actually the corner case, and it is a corner case that breaks
>>> Internet-of-Things applications. We can tightly control clock skew in our
>>> cluster. We most definitely CANNOT control clock skew on the thousands of
>>> sensors that write to our cluster.
>>>
>>> Thanks,
>>> Cody
>>>
>>> On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille <rwi...@fold3.com> wrote:
>>>
>>> In my opinion, this is not broken and “fixing” it would break existing
>>> code. Consider a batch that includes multiple inserts, each of which
>>> inserts the value returned by now(). Getting the same UUID for each insert
>>> would be a major problem.
>>>
>>> Cheers
>>>
>>> Robert
>>>
>>>
>>> On Nov 30, 2016, at 4:46 PM, Todd Fast <t...@digitalexistence.com>
>>> wrote:
>>>
>>> FWIW I'd suggest opening a bug--this behavior is c

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Benjamin Roth

+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>> On Wed, Nov 30, 2016 at 6:16 PM, Cody Yancey <yan...@uber.com> wrote:
>>
>> Getting the same TimeUUID values might be a major problem. Getting two
>> different TimeUUIDs that at least have time component would not be a major
>> problem as this is the main case today. Getting different time components
>> is actually the corner case, and it is a corner case that breaks
>> Internet-of-Things applications. We can tightly control clock skew in our
>> cluster. We most definitely CANNOT control clock skew on the thousands of
>> sensors that write to our cluster.
>>
>> Thanks,
>> Cody
>>
>> On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille <rwi...@fold3.com> wrote:
>>
>> In my opinion, this is not broken and “fixing” it would break existing
>> code. Consider a batch that includes multiple inserts, each of which
>> inserts the value returned by now(). Getting the same UUID for each insert
>> would be a major problem.
>>
>> Cheers
>>
>> Robert
>>
>>
>> On Nov 30, 2016, at 4:46 PM, Todd Fast <t...@digitalexistence.com> wrote:
>>
>> FWIW I'd suggest opening a bug--this behavior is certainly quite
>> unexpected and more than just a documentation issue. In general I can't
>> imagine any desirable properties of the current implementation, and there
>> are likely a bunch of latent bugs sitting out there, so it should be fixed.
>>
>> Todd
>>
>> On Wed, Nov 30, 2016 at 12:37 PM Terry Liu <t...@turnitin.com> wrote:
>>
>> Sorry for my typo. Obviously, I meant:
>> "It appears that a single query that calls Cassandra's`now()` time
>> function *multiple times *may actually cause a query to write or return
>> different times."
>>
>> Less of a surprise now that I realize more about the implementation, but
>> I agree that more explicit documentation around when exactly the
>> "execution" of each now() statement happens and what implications it has
>> for the resulting timestamps would be helpful when running into this.
>>
>> Thanks for the quick responses!
>>
>> -Terry
>>
>>
>>
>> On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek <msval...@gmail.com>
>> wrote:
>>
>> every now() call in statement is under the hood "replaced" with newly
>> generated uuid.
>>
>> It can happen that they belong to  different milliseconds in time.
>>
>> If you need to have same timestamps you need to set them on the client
>> side.
>>
>>
>> @msvaljek <https://twitter.com/msvaljek>
>>
>> 2016-11-29 22:49 GMT+01:00 Terry Liu <t...@turnitin.com>:
>>
>> It appears that a single query that calls Cassandra's `now()` time
>> function may actually cause a query to write or return different times.
>>
>> Is this the expected or defined behavior, and if so, why does it behave
>> like this rather than evaluating `now()` once across an entire statement?
>>
>> This really affects UPDATE statements but to test it more easily, you
>> could try something like:
>>
>> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
>> FROM keyspace.table
>> LIMIT 100;
>>
>> If you run that a few times, you should eventually see that the timestamp
>> returned moves onto the next millisecond mid-query.
>>
>> --
>> *Software Engineer*
>> Turnitin - http://www.turnitin.com
>> t...@turnitin.com
>>
>>
>>
>>
>>
>> --
>> *Software Engineer*
>> Turnitin - http://www.turnitin.com
>> t...@turnitin.com
>>
>>
>>
>>
>>
>> --
> Ben Bromhead
> CTO | Instaclustr <https://www.instaclustr.com/>
> +1 650 284 9692 <+1%20650-284-9692>
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Ben Bromhead

ster. We most definitely CANNOT control clock skew on the thousands of
> sensors that write to our cluster.
>
> Thanks,
> Cody
>
> On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille <rwi...@fold3.com> wrote:
>
> In my opinion, this is not broken and “fixing” it would break existing
> code. Consider a batch that includes multiple inserts, each of which
> inserts the value returned by now(). Getting the same UUID for each insert
> would be a major problem.
>
> Cheers
>
> Robert
>
>
> On Nov 30, 2016, at 4:46 PM, Todd Fast <t...@digitalexistence.com> wrote:
>
> FWIW I'd suggest opening a bug--this behavior is certainly quite
> unexpected and more than just a documentation issue. In general I can't
> imagine any desirable properties of the current implementation, and there
> are likely a bunch of latent bugs sitting out there, so it should be fixed.
>
> Todd
>
> On Wed, Nov 30, 2016 at 12:37 PM Terry Liu <t...@turnitin.com> wrote:
>
> Sorry for my typo. Obviously, I meant:
> "It appears that a single query that calls Cassandra's`now()` time
> function *multiple times *may actually cause a query to write or return
> different times."
>
> Less of a surprise now that I realize more about the implementation, but I
> agree that more explicit documentation around when exactly the "execution"
> of each now() statement happens and what implications it has for the
> resulting timestamps would be helpful when running into this.
>
> Thanks for the quick responses!
>
> -Terry
>
>
>
> On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek <msval...@gmail.com> wrote:
>
> every now() call in statement is under the hood "replaced" with newly
> generated uuid.
>
> It can happen that they belong to  different milliseconds in time.
>
> If you need to have same timestamps you need to set them on the client
> side.
>
>
> @msvaljek <https://twitter.com/msvaljek>
>
> 2016-11-29 22:49 GMT+01:00 Terry Liu <t...@turnitin.com>:
>
> It appears that a single query that calls Cassandra's `now()` time
> function may actually cause a query to write or return different times.
>
> Is this the expected or defined behavior, and if so, why does it behave
> like this rather than evaluating `now()` once across an entire statement?
>
> This really affects UPDATE statements but to test it more easily, you
> could try something like:
>
> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
> FROM keyspace.table
> LIMIT 100;
>
> If you run that a few times, you should eventually see that the timestamp
> returned moves onto the next millisecond mid-query.
>
> --
> *Software Engineer*
> Turnitin - http://www.turnitin.com
> t...@turnitin.com
>
>
>
>
>
> --
> *Software Engineer*
> Turnitin - http://www.turnitin.com
> t...@turnitin.com
>
>
>
>
>
> --
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Edward Capriolo

On Wed, Nov 30, 2016 at 10:53 PM, Cody Yancey <yan...@uber.com> wrote:

> This is not a bug, and in fact changing it would be a serious bug.
>
> False. Absolutely no consumer would be broken by a change to guarantee an
> identical time component that isn't broken already, for the simple reason
> your code already has to handle that case, as it is in fact the majority
> case RIGHT NOW. Users can hit this bug, in production, because unit tests
> might not experienced it! The time component should be the time that the
> command was processed by the coordinator node.
>
>  would one expect a java/py/bash script that loops
>
> Individual Cassandra writes (which is what OP is referring to
> specifically) are not loops. They are in almost every case atomic
> operations that either succeed completely or fail completely. Allowing a
> single atomic operation to witness multiple times in these corner cases is
> not only surprising, as this thread demonstrates, it is also needlessly
> restricting to what developers can use the database for, and provides NO
> BENEFIT.
>
> Calling now PRIOR to initiating multiple inserts is in most cases
> exactly what one does...the ONLY practice is to set the value before
> initiating the sequence of calls
>
> Also false. Cassandra does not have a way of doing this on the coordinator
> node rather than the client device, and as I already showed, the client
> device is the wrong place to do it in situations where guaranteeing bounded
> clock-skew actually makes a difference one way or the other.
>
> Thanks,
> Cody
>
>
>
> On Wed, Nov 30, 2016 at 8:02 PM, daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> This is not a bug, and in fact changing it would be a serious bug.
>>
>> What it is is a wonderful case of bad coding: would one expect a
>> java/py/bash script that loops on a bunch of read/execut/update calls where
>> each iteration calls time to return the same exact time for the duration of
>> the execution of the code? Whether the code runs for 5 seconds or 5 hours?
>>
>> Every call to a system call is unique, including within C*. Calling now
>> PRIOR to initiating multiple inserts is in most cases exactly what one does
>> to assure unique time stamps FOR THE BATCH OF INSERTS. To get a nearly
>> identical system time as would be the uuid of the row, one tries to call
>> time as close to just before the insert as possible. Then repeat.
>>
>> You have a logic issue in your code. If you want the same value for a set
>> of calls, the ONLY practice is to set the value before initiating the
>> sequence of calls.
>>
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>> On Wed, Nov 30, 2016 at 6:16 PM, Cody Yancey <yan...@uber.com> wrote:
>>
>>> Getting the same TimeUUID values might be a major problem. Getting two
>>> different TimeUUIDs that at least have time component would not be a major
>>> problem as this is the main case today. Getting different time components
>>> is actually the corner case, and it is a corner case that breaks
>>> Internet-of-Things applications. We can tightly control clock skew in our
>>> cluster. We most definitely CANNOT control clock skew on the thousands of
>>> sensors that write to our cluster.
>>>
>>> Thanks,
>>> Cody
>>>
>>> On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille <rwi...@fold3.com> wrote:
>>>
>>>> In my opinion, this is not broken and “fixing” it would break existing
>>>> code. Consider a batch that includes multiple inserts, each of which
>>>> inserts the value returned by now(). Getting the same UUID for each insert
>>>> would be a major problem.
>>>>
>>>> Cheers
>>>>
>>>> Robert
>>>>
>>>>
>>>> On Nov 30, 2016, at 4:46 PM, Todd Fast <t...@digitalexistence.com>
>>>> wrote:
>>>>
>>>> FWIW I'd suggest opening a bug--this behavior is certainly quite
>>>> unexpected and more than just a documentation issue. In general I can't
>>>> imagine any desirable properties of the current implementation, and there
>>>> are likely a bunch of latent bugs sitting out there, so it should be fixed.
>>>>
>>>> Todd
>>>>
>>>> On Wed, Nov 30, 2016 at 12:37 PM Terry Liu <t...@turnitin.com> wrote:
>>>>
>>>>> Sorry for my typo. Obviously, I

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Cody Yancey

This is not a bug, and in fact changing it would be a serious bug.

False. Absolutely no consumer would be broken by a change to guarantee an
identical time component that isn't broken already, for the simple reason
your code already has to handle that case, as it is in fact the majority
case RIGHT NOW. Users can hit this bug, in production, because unit tests
might not experienced it! The time component should be the time that the
command was processed by the coordinator node.

 would one expect a java/py/bash script that loops

Individual Cassandra writes (which is what OP is referring to specifically)
are not loops. They are in almost every case atomic operations that either
succeed completely or fail completely. Allowing a single atomic operation
to witness multiple times in these corner cases is not only surprising, as
this thread demonstrates, it is also needlessly restricting to what
developers can use the database for, and provides NO BENEFIT.

Calling now PRIOR to initiating multiple inserts is in most cases
exactly what one does...the ONLY practice is to set the value before
initiating the sequence of calls

Also false. Cassandra does not have a way of doing this on the coordinator
node rather than the client device, and as I already showed, the client
device is the wrong place to do it in situations where guaranteeing bounded
clock-skew actually makes a difference one way or the other.

Thanks,
Cody



On Wed, Nov 30, 2016 at 8:02 PM, daemeon reiydelle <daeme...@gmail.com>
wrote:

> This is not a bug, and in fact changing it would be a serious bug.
>
> What it is is a wonderful case of bad coding: would one expect a
> java/py/bash script that loops on a bunch of read/execut/update calls where
> each iteration calls time to return the same exact time for the duration of
> the execution of the code? Whether the code runs for 5 seconds or 5 hours?
>
> Every call to a system call is unique, including within C*. Calling now
> PRIOR to initiating multiple inserts is in most cases exactly what one does
> to assure unique time stamps FOR THE BATCH OF INSERTS. To get a nearly
> identical system time as would be the uuid of the row, one tries to call
> time as close to just before the insert as possible. Then repeat.
>
> You have a logic issue in your code. If you want the same value for a set
> of calls, the ONLY practice is to set the value before initiating the
> sequence of calls.
>
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>
> On Wed, Nov 30, 2016 at 6:16 PM, Cody Yancey <yan...@uber.com> wrote:
>
>> Getting the same TimeUUID values might be a major problem. Getting two
>> different TimeUUIDs that at least have time component would not be a major
>> problem as this is the main case today. Getting different time components
>> is actually the corner case, and it is a corner case that breaks
>> Internet-of-Things applications. We can tightly control clock skew in our
>> cluster. We most definitely CANNOT control clock skew on the thousands of
>> sensors that write to our cluster.
>>
>> Thanks,
>> Cody
>>
>> On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille <rwi...@fold3.com> wrote:
>>
>>> In my opinion, this is not broken and “fixing” it would break existing
>>> code. Consider a batch that includes multiple inserts, each of which
>>> inserts the value returned by now(). Getting the same UUID for each insert
>>> would be a major problem.
>>>
>>> Cheers
>>>
>>> Robert
>>>
>>>
>>> On Nov 30, 2016, at 4:46 PM, Todd Fast <t...@digitalexistence.com>
>>> wrote:
>>>
>>> FWIW I'd suggest opening a bug--this behavior is certainly quite
>>> unexpected and more than just a documentation issue. In general I can't
>>> imagine any desirable properties of the current implementation, and there
>>> are likely a bunch of latent bugs sitting out there, so it should be fixed.
>>>
>>> Todd
>>>
>>> On Wed, Nov 30, 2016 at 12:37 PM Terry Liu <t...@turnitin.com> wrote:
>>>
>>>> Sorry for my typo. Obviously, I meant:
>>>> "It appears that a single query that calls Cassandra's`now()` time
>>>> function *multiple times *may actually cause a query to write or
>>>> return different times."
>>>>
>>>> Less of a surprise now that I realize more about the implementation,
>>>> but I agree that more explicit documentation around when exactly the
>>>> "execution" of each now() statement happens and what impl

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread daemeon reiydelle

This is not a bug, and in fact changing it would be a serious bug.

What it is is a wonderful case of bad coding: would one expect a
java/py/bash script that loops on a bunch of read/execut/update calls where
each iteration calls time to return the same exact time for the duration of
the execution of the code? Whether the code runs for 5 seconds or 5 hours?

Every call to a system call is unique, including within C*. Calling now
PRIOR to initiating multiple inserts is in most cases exactly what one does
to assure unique time stamps FOR THE BATCH OF INSERTS. To get a nearly
identical system time as would be the uuid of the row, one tries to call
time as close to just before the insert as possible. Then repeat.

You have a logic issue in your code. If you want the same value for a set
of calls, the ONLY practice is to set the value before initiating the
sequence of calls.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 30, 2016 at 6:16 PM, Cody Yancey <yan...@uber.com> wrote:

> Getting the same TimeUUID values might be a major problem. Getting two
> different TimeUUIDs that at least have time component would not be a major
> problem as this is the main case today. Getting different time components
> is actually the corner case, and it is a corner case that breaks
> Internet-of-Things applications. We can tightly control clock skew in our
> cluster. We most definitely CANNOT control clock skew on the thousands of
> sensors that write to our cluster.
>
> Thanks,
> Cody
>
> On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille <rwi...@fold3.com> wrote:
>
>> In my opinion, this is not broken and “fixing” it would break existing
>> code. Consider a batch that includes multiple inserts, each of which
>> inserts the value returned by now(). Getting the same UUID for each insert
>> would be a major problem.
>>
>> Cheers
>>
>> Robert
>>
>>
>> On Nov 30, 2016, at 4:46 PM, Todd Fast <t...@digitalexistence.com> wrote:
>>
>> FWIW I'd suggest opening a bug--this behavior is certainly quite
>> unexpected and more than just a documentation issue. In general I can't
>> imagine any desirable properties of the current implementation, and there
>> are likely a bunch of latent bugs sitting out there, so it should be fixed.
>>
>> Todd
>>
>> On Wed, Nov 30, 2016 at 12:37 PM Terry Liu <t...@turnitin.com> wrote:
>>
>>> Sorry for my typo. Obviously, I meant:
>>> "It appears that a single query that calls Cassandra's`now()` time
>>> function *multiple times *may actually cause a query to write or return
>>> different times."
>>>
>>> Less of a surprise now that I realize more about the implementation, but
>>> I agree that more explicit documentation around when exactly the
>>> "execution" of each now() statement happens and what implications it has
>>> for the resulting timestamps would be helpful when running into this.
>>>
>>> Thanks for the quick responses!
>>>
>>> -Terry
>>>
>>>
>>>
>>> On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek <msval...@gmail.com>
>>> wrote:
>>>
>>> every now() call in statement is under the hood "replaced" with newly
>>> generated uuid.
>>>
>>> It can happen that they belong to  different milliseconds in time.
>>>
>>> If you need to have same timestamps you need to set them on the client
>>> side.
>>>
>>>
>>> @msvaljek <https://twitter.com/msvaljek>
>>>
>>> 2016-11-29 22:49 GMT+01:00 Terry Liu <t...@turnitin.com>:
>>>
>>> It appears that a single query that calls Cassandra's `now()` time
>>> function may actually cause a query to write or return different times.
>>>
>>> Is this the expected or defined behavior, and if so, why does it behave
>>> like this rather than evaluating `now()` once across an entire statement?
>>>
>>> This really affects UPDATE statements but to test it more easily, you
>>> could try something like:
>>>
>>> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
>>> FROM keyspace.table
>>> LIMIT 100;
>>>
>>> If you run that a few times, you should eventually see that the
>>> timestamp returned moves onto the next millisecond mid-query.
>>>
>>> --
>>> *Software Engineer*
>>> Turnitin - http://www.turnitin.com
>>> t...@turnitin.com
>>>
>>>
>>>
>>>
>>>
>>> --
>>> *Software Engineer*
>>> Turnitin - http://www.turnitin.com
>>> t...@turnitin.com
>>>
>>
>>
>

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Cody Yancey

Getting the same TimeUUID values might be a major problem. Getting two
different TimeUUIDs that at least have time component would not be a major
problem as this is the main case today. Getting different time components
is actually the corner case, and it is a corner case that breaks
Internet-of-Things applications. We can tightly control clock skew in our
cluster. We most definitely CANNOT control clock skew on the thousands of
sensors that write to our cluster.

Thanks,
Cody

On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille <rwi...@fold3.com> wrote:

> In my opinion, this is not broken and “fixing” it would break existing
> code. Consider a batch that includes multiple inserts, each of which
> inserts the value returned by now(). Getting the same UUID for each insert
> would be a major problem.
>
> Cheers
>
> Robert
>
>
> On Nov 30, 2016, at 4:46 PM, Todd Fast <t...@digitalexistence.com> wrote:
>
> FWIW I'd suggest opening a bug--this behavior is certainly quite
> unexpected and more than just a documentation issue. In general I can't
> imagine any desirable properties of the current implementation, and there
> are likely a bunch of latent bugs sitting out there, so it should be fixed.
>
> Todd
>
> On Wed, Nov 30, 2016 at 12:37 PM Terry Liu <t...@turnitin.com> wrote:
>
>> Sorry for my typo. Obviously, I meant:
>> "It appears that a single query that calls Cassandra's`now()` time
>> function *multiple times *may actually cause a query to write or return
>> different times."
>>
>> Less of a surprise now that I realize more about the implementation, but
>> I agree that more explicit documentation around when exactly the
>> "execution" of each now() statement happens and what implications it has
>> for the resulting timestamps would be helpful when running into this.
>>
>> Thanks for the quick responses!
>>
>> -Terry
>>
>>
>>
>> On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek <msval...@gmail.com>
>> wrote:
>>
>> every now() call in statement is under the hood "replaced" with newly
>> generated uuid.
>>
>> It can happen that they belong to  different milliseconds in time.
>>
>> If you need to have same timestamps you need to set them on the client
>> side.
>>
>>
>> @msvaljek <https://twitter.com/msvaljek>
>>
>> 2016-11-29 22:49 GMT+01:00 Terry Liu <t...@turnitin.com>:
>>
>> It appears that a single query that calls Cassandra's `now()` time
>> function may actually cause a query to write or return different times.
>>
>> Is this the expected or defined behavior, and if so, why does it behave
>> like this rather than evaluating `now()` once across an entire statement?
>>
>> This really affects UPDATE statements but to test it more easily, you
>> could try something like:
>>
>> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
>> FROM keyspace.table
>> LIMIT 100;
>>
>> If you run that a few times, you should eventually see that the timestamp
>> returned moves onto the next millisecond mid-query.
>>
>> --
>> *Software Engineer*
>> Turnitin - http://www.turnitin.com
>> t...@turnitin.com
>>
>>
>>
>>
>>
>> --
>> *Software Engineer*
>> Turnitin - http://www.turnitin.com
>> t...@turnitin.com
>>
>
>

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Robert Wille

In my opinion, this is not broken and “fixing” it would break existing code. 
Consider a batch that includes multiple inserts, each of which inserts the 
value returned by now(). Getting the same UUID for each insert would be a major 
problem.

Cheers

Robert

On Nov 30, 2016, at 4:46 PM, Todd Fast 
<t...@digitalexistence.com<mailto:t...@digitalexistence.com>> wrote:

FWIW I'd suggest opening a bug--this behavior is certainly quite unexpected and 
more than just a documentation issue. In general I can't imagine any desirable 
properties of the current implementation, and there are likely a bunch of 
latent bugs sitting out there, so it should be fixed.

Todd

On Wed, Nov 30, 2016 at 12:37 PM Terry Liu 
<t...@turnitin.com<mailto:t...@turnitin.com>> wrote:
Sorry for my typo. Obviously, I meant:
"It appears that a single query that calls Cassandra's`now()` time function 
multiple times may actually cause a query to write or return different times."

Less of a surprise now that I realize more about the implementation, but I 
agree that more explicit documentation around when exactly the "execution" of 
each now() statement happens and what implications it has for the resulting 
timestamps would be helpful when running into this.

Thanks for the quick responses!

-Terry



On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek 
<msval...@gmail.com<mailto:msval...@gmail.com>> wrote:
every now() call in statement is under the hood "replaced" with newly generated 
uuid.

It can happen that they belong to  different milliseconds in time.

If you need to have same timestamps you need to set them on the client side.


@msvaljek<https://twitter.com/msvaljek>

2016-11-29 22:49 GMT+01:00 Terry Liu 
<t...@turnitin.com<mailto:t...@turnitin.com>>:
It appears that a single query that calls Cassandra's `now()` time function may 
actually cause a query to write or return different times.

Is this the expected or defined behavior, and if so, why does it behave like 
this rather than evaluating `now()` once across an entire statement?

This really affects UPDATE statements but to test it more easily, you could try 
something like:

SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
FROM keyspace.table
LIMIT 100;

If you run that a few times, you should eventually see that the timestamp 
returned moves onto the next millisecond mid-query.

--
Software Engineer
Turnitin - http://www.turnitin.com<http://www.turnitin.com/>
t...@turnitin.com<mailto:t...@turnitin.com>




--
Software Engineer
Turnitin - http://www.turnitin.com<http://www.turnitin.com/>
t...@turnitin.com<mailto:t...@turnitin.com>

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Todd Fast

FWIW I'd suggest opening a bug--this behavior is certainly quite unexpected
and more than just a documentation issue. In general I can't imagine any
desirable properties of the current implementation, and there are likely a
bunch of latent bugs sitting out there, so it should be fixed.

Todd

On Wed, Nov 30, 2016 at 12:37 PM Terry Liu <t...@turnitin.com> wrote:

> Sorry for my typo. Obviously, I meant:
> "It appears that a single query that calls Cassandra's`now()` time
> function *multiple times *may actually cause a query to write or return
> different times."
>
> Less of a surprise now that I realize more about the implementation, but I
> agree that more explicit documentation around when exactly the "execution"
> of each now() statement happens and what implications it has for the
> resulting timestamps would be helpful when running into this.
>
> Thanks for the quick responses!
>
> -Terry
>
>
>
> On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek <msval...@gmail.com> wrote:
>
> every now() call in statement is under the hood "replaced" with newly
> generated uuid.
>
> It can happen that they belong to  different milliseconds in time.
>
> If you need to have same timestamps you need to set them on the client
> side.
>
>
> @msvaljek <https://twitter.com/msvaljek>
>
> 2016-11-29 22:49 GMT+01:00 Terry Liu <t...@turnitin.com>:
>
> It appears that a single query that calls Cassandra's `now()` time
> function may actually cause a query to write or return different times.
>
> Is this the expected or defined behavior, and if so, why does it behave
> like this rather than evaluating `now()` once across an entire statement?
>
> This really affects UPDATE statements but to test it more easily, you
> could try something like:
>
> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
> FROM keyspace.table
> LIMIT 100;
>
> If you run that a few times, you should eventually see that the timestamp
> returned moves onto the next millisecond mid-query.
>
> --
> *Software Engineer*
> Turnitin - http://www.turnitin.com
> t...@turnitin.com
>
>
>
>
>
> --
> *Software Engineer*
> Turnitin - http://www.turnitin.com
> t...@turnitin.com
>

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Terry Liu

Sorry for my typo. Obviously, I meant:
"It appears that a single query that calls Cassandra's`now()` time
function *multiple
times *may actually cause a query to write or return different times."

Less of a surprise now that I realize more about the implementation, but I
agree that more explicit documentation around when exactly the "execution"
of each now() statement happens and what implications it has for the
resulting timestamps would be helpful when running into this.

Thanks for the quick responses!

-Terry

On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek <msval...@gmail.com> wrote:

> every now() call in statement is under the hood "replaced" with newly
> generated uuid.
>
> It can happen that they belong to  different milliseconds in time.
>
> If you need to have same timestamps you need to set them on the client
> side.
>
>
> @msvaljek <https://twitter.com/msvaljek>
>
> 2016-11-29 22:49 GMT+01:00 Terry Liu <t...@turnitin.com>:
>
>> It appears that a single query that calls Cassandra's `now()` time
>> function may actually cause a query to write or return different times.
>>
>> Is this the expected or defined behavior, and if so, why does it behave
>> like this rather than evaluating `now()` once across an entire statement?
>>
>> This really affects UPDATE statements but to test it more easily, you
>> could try something like:
>>
>> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
>> FROM keyspace.table
>> LIMIT 100;
>>
>> If you run that a few times, you should eventually see that the timestamp
>> returned moves onto the next millisecond mid-query.
>>
>> --
>> *Software Engineer*
>> Turnitin - http://www.turnitin.com
>> t...@turnitin.com
>>
>
>

-- 
*Software Engineer*
Turnitin - http://www.turnitin.com
t...@turnitin.com

Re: Why does `now()` produce different times within the same query?

2016-11-29 Thread Marko Švaljek

every now() call in statement is under the hood "replaced" with newly
generated uuid.

It can happen that they belong to  different milliseconds in time.

If you need to have same timestamps you need to set them on the client side.


@msvaljek <https://twitter.com/msvaljek>

2016-11-29 22:49 GMT+01:00 Terry Liu <t...@turnitin.com>:

> It appears that a single query that calls Cassandra's `now()` time
> function may actually cause a query to write or return different times.
>
> Is this the expected or defined behavior, and if so, why does it behave
> like this rather than evaluating `now()` once across an entire statement?
>
> This really affects UPDATE statements but to test it more easily, you
> could try something like:
>
> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
> FROM keyspace.table
> LIMIT 100;
>
> If you run that a few times, you should eventually see that the timestamp
> returned moves onto the next millisecond mid-query.
>
> --
> *Software Engineer*
> Turnitin - http://www.turnitin.com
> t...@turnitin.com
>

Re: Why does `now()` produce different times within the same query?

2016-11-29 Thread Ariel Weisberg

Hi,

The function is defined here[1]. I hope my email client isn't
butchering the code.

public static final Function *nowFct *= new NativeScalarFunction("now",
TimeUUIDType.*instance*)
{

public ByteBuffer execute(ProtocolVersion protocolVersion,
List parameters)
{

return ByteBuffer.*wrap*(UUIDGen.*getTimeUUIDBytes*());
}

};

It's documented as
http://cassandra.apache.org/doc/latest/cql/functions.html#timeuuid-functions:
> The now function takes no arguments and generates, on the coordinator
> node, a new unique timeuuid (at the time where the statement using it
> is executed).
> Now is the behavior consistent with the documentation? Well it depends
> on how you define statement (CQL statement, or function call) I
> suppose. I do think the doc needs to be updated because in terms of
> principle of least surprise yes this is a little surprising.

I know of a couple of systems that associate a timestamp with each
transaction and will always return the same time when you request the
current time. However you aren't requesting the current time you are
requesting a UUID using a function named now().  I think we are stuck
with the behavior and need an additional function that does what would
expect from a function named now().

As a work around you can pass the time in as a parameter and then you
can guarantee it will be the same in each position.

There is also the implicit
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWritetime.html for
each column. Writetime didn't seem have hits in the Apache docs so I
linked to the Datastax docs. I'll see about getting them updated.

Regards,

Ariel

On Tue, Nov 29, 2016, at 04:49 PM, Terry Liu wrote:

> It appears that a single query that calls Cassandra's `now()`
> time function may actually cause a query to write or return
> different times.
> 

> Is this the expected or defined behavior, and if so, why does it
> behave like this rather than evaluating `now()` once across an entire
> statement?
> 

> This really affects UPDATE statements but to test it more easily, you
> could try something like:
> 

> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b

> FROM keyspace.table

> LIMIT 100;

> 

> If you run that a few times, you should eventually see that the
> timestamp returned moves onto the next millisecond mid-query.
> 

> -- 

> *Software Engineer*

> Turnitin - http://www.turnitin.com[2]

> t...@turnitin.com

Links:

  1. 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/functions/TimeFcts.java#L54
  2. http://www.turnitin.com/

Why does `now()` produce different times within the same query?

2016-11-29 Thread Terry Liu

It appears that a single query that calls Cassandra's `now()` time function
may actually cause a query to write or return different times.

Is this the expected or defined behavior, and if so, why does it behave
like this rather than evaluating `now()` once across an entire statement?

This really affects UPDATE statements but to test it more easily, you could
try something like:

SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
FROM keyspace.table
LIMIT 100;

If you run that a few times, you should eventually see that the timestamp
returned moves onto the next millisecond mid-query.

-- 
*Software Engineer*
Turnitin - http://www.turnitin.com
t...@turnitin.com

"java.io.IOError: java.io.EOFException: EOF after 13889 bytes out of 460861" occured when I query from a table

2016-10-31 Thread ????/??????

Hi, all
I hava a problem. I create a table named "tblA" in c* and create a 
materialized view name viewA on tblA. I run spark job to processing data from 
'viewA'.
In the beginning, it works well. But in the next day, the spark job failed. 
And when I select data from the 'viewA' and 'tblA' using cql, it throw the 
follwing exception.
query from viewA:
 "ServerError: "
and query from tblA:
 "ServerError: "


My system version is :
Cassandra 3.7  +   spark1.6.2   +  Spark Cassandra Connector 1.6


If anyone know about this problem? Look forward to your reply.


Thanks

Re: Cassandra failure during read query at consistency QUORUM (2 responses were required but only 0 replica responded, 2 failed)

2016-10-30 Thread Denis Mikhaylov

Why does it prevent quorum? How to fix it?

> On 28 Oct 2016, at 16:02, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> 
> This looks like another case of an assert bubbling through try catch that 
> don't catch assert
> 
> On Fri, Oct 28, 2016 at 6:30 AM, Denis Mikhaylov <notxc...@gmail.com> wrote:
> Hi!
> 
> We’re running Cassandra 3.9
> 
> On the application side I see failed reads with this exception 
> com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure 
> during read query at consistency QUORUM (2 responses were required but only 0 
> replica responded, 2 failed)
> 
> On the server side we see:
> 
> WARN  [SharedPool-Worker-3] 2016-10-28 13:28:22,965 
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-3,5,
> main]: {}
> java.lang.AssertionError: null
> at org.apache.cassandra.db.rows.BTreeRow.getCell(BTreeRow.java:212) 
> ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.SinglePartitionReadCommand.canRemoveRow(SinglePartitionReadCommand.java:899)
>  ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.SinglePartitionReadCommand.reduceFilter(SinglePartitionReadCommand.java:863)
>  ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndSSTablesInTimestampOrder(SinglePartitionReadCommand.java:748)
>  ~[apache-cassan
> dra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:519)
>  ~[apache-cassandra-3.7.jar:
> 3.7]
> at 
> org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:496)
>  ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.SinglePartitionReadCommand.queryStorage(SinglePartitionReadCommand.java:358)
>  ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:366) 
> ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:48)
>  ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) 
> ~[apache-cassandra-3.7.jar:3.7]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_102]
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
>  ~[apache-cassandra-
> 3.7.jar:3.7]
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
>  [apache
> -cassandra-3.7.jar:3.7]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
> [apache-cassandra-3.7.jar:3.7]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]
> 
> It’s only affect single table. Sadly both on test (3.9) and production (3.7) 
> deployments of cassandra.
> 
> What could be the problem? Please help.
>

Re: Cassandra failure during read query at consistency QUORUM (2 responses were required but only 0 replica responded, 2 failed)

2016-10-28 Thread Edward Capriolo

This looks like another case of an assert bubbling through try catch that
don't catch assert

On Fri, Oct 28, 2016 at 6:30 AM, Denis Mikhaylov <notxc...@gmail.com> wrote:

> Hi!
>
> We’re running Cassandra 3.9
>
> On the application side I see failed reads with this exception
> com.datastax.driver.core.exceptions.ReadFailureException: Cassandra
> failure during read query at consistency QUORUM (2 responses were required
> but only 0 replica responded, 2 failed)
>
> On the server side we see:
>
> WARN  [SharedPool-Worker-3] 2016-10-28 13:28:22,965
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-3,5,
> main]: {}
> java.lang.AssertionError: null
> at org.apache.cassandra.db.rows.BTreeRow.getCell(BTreeRow.java:212)
> ~[apache-cassandra-3.7.jar:3.7]
> at org.apache.cassandra.db.SinglePartitionReadCommand.
> canRemoveRow(SinglePartitionReadCommand.java:899)
> ~[apache-cassandra-3.7.jar:3.7]
> at org.apache.cassandra.db.SinglePartitionReadCommand.
> reduceFilter(SinglePartitionReadCommand.java:863)
> ~[apache-cassandra-3.7.jar:3.7]
> at org.apache.cassandra.db.SinglePartitionReadCommand.
> queryMemtableAndSSTablesInTimestampOrder(SinglePartitionReadCommand.java:748)
> ~[apache-cassan
> dra-3.7.jar:3.7]
> at org.apache.cassandra.db.SinglePartitionReadCommand.
> queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:519)
> ~[apache-cassandra-3.7.jar:
> 3.7]
> at org.apache.cassandra.db.SinglePartitionReadCommand.
> queryMemtableAndDisk(SinglePartitionReadCommand.java:496)
> ~[apache-cassandra-3.7.jar:3.7]
> at org.apache.cassandra.db.SinglePartitionReadCommand.
> queryStorage(SinglePartitionReadCommand.java:358)
> ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:366)
> ~[apache-cassandra-3.7.jar:3.7]
> at org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(
> ReadCommandVerbHandler.java:48) ~[apache-cassandra-3.7.jar:3.7]
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
> ~[apache-cassandra-3.7.jar:3.7]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_102]
> at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorServ
> ice$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
> ~[apache-cassandra-
> 3.7.jar:3.7]
> at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorServ
> ice$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
> [apache
> -cassandra-3.7.jar:3.7]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-3.7.jar:3.7]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]
>
> It’s only affect single table. Sadly both on test (3.9) and production
> (3.7) deployments of cassandra.
>
> What could be the problem? Please help.

Cassandra failure during read query at consistency QUORUM (2 responses were required but only 0 replica responded, 2 failed)

2016-10-28 Thread Denis Mikhaylov

Hi!

We’re running Cassandra 3.9

On the application side I see failed reads with this exception 
com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure 
during read query at consistency QUORUM (2 responses were required but only 0 
replica responded, 2 failed)

On the server side we see:

WARN  [SharedPool-Worker-3] 2016-10-28 13:28:22,965 
AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread 
Thread[SharedPool-Worker-3,5,
main]: {}
java.lang.AssertionError: null
at org.apache.cassandra.db.rows.BTreeRow.getCell(BTreeRow.java:212) 
~[apache-cassandra-3.7.jar:3.7]
at 
org.apache.cassandra.db.SinglePartitionReadCommand.canRemoveRow(SinglePartitionReadCommand.java:899)
 ~[apache-cassandra-3.7.jar:3.7]
at 
org.apache.cassandra.db.SinglePartitionReadCommand.reduceFilter(SinglePartitionReadCommand.java:863)
 ~[apache-cassandra-3.7.jar:3.7]
at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndSSTablesInTimestampOrder(SinglePartitionReadCommand.java:748)
 ~[apache-cassan
dra-3.7.jar:3.7]
at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:519)
 ~[apache-cassandra-3.7.jar:
3.7]
at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:496)
 ~[apache-cassandra-3.7.jar:3.7]
at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryStorage(SinglePartitionReadCommand.java:358)
 ~[apache-cassandra-3.7.jar:3.7]
at 
org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:366) 
~[apache-cassandra-3.7.jar:3.7]
at 
org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:48)
 ~[apache-cassandra-3.7.jar:3.7]
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) 
~[apache-cassandra-3.7.jar:3.7]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_102]
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
 ~[apache-cassandra-
3.7.jar:3.7]
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
 [apache
-cassandra-3.7.jar:3.7]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
[apache-cassandra-3.7.jar:3.7]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]

It’s only affect single table. Sadly both on test (3.9) and production (3.7) 
deployments of cassandra.

What could be the problem? Please help.

Re: Cannot restrict clustering columns by IN relations when a collection is selected by the query

2016-10-27 Thread DuyHai Doan

https://issues.apache.org/jira/browse/CASSANDRA-12654

On Thu, Oct 27, 2016 at 9:59 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> I have the following table schema:
>
> *CREATE TABLE ticket_by_member (*
> * project_id text,*
> * member_id text,*
> * ticket_id text,*
> * ticket ticket,*
> *assigned_members list,*
> * votes list<FROZEN>,*
> *labels list<FROZEN>,*
> * PRIMARY KEY ( project_id, member_id, ticket_id )*
> *);*
>
> I have a scenario where I need to show all tickets for a particular
> project, by a group of member ids.
>
> I think it would be more efficient to do this as an IN query of the type: 
> *project_id
> = x AND member_id IN (...)*, instead of doing multiple queries of: *project_id
> = x AND member_id = y*
>
> I tried to setup an accessor for this, as the following:
>
> *   @Query("SELECT * FROM ticket_by_member WHERE project_id = ? AND
> member_id IN(?)" )*
>
> *Result cardsByMembers(String projectId,
> List memberIds);*
>
> But when I call this method, I get the exception:
>
>  java.util.concurrent.ExecutionException: com.datastax.driver.core.
> exceptions.InvalidQueryException: Cannot restrict clustering columns by
> IN relations when a collection is selected by the query
>
> Any ideas on why this isn't working?
>

Cannot restrict clustering columns by IN relations when a collection is selected by the query

2016-10-27 Thread Ali Akhtar

I have the following table schema:

*CREATE TABLE ticket_by_member (*
* project_id text,*
* member_id text,*
* ticket_id text,*
* ticket ticket,*
*assigned_members list,*
* votes list<FROZEN>,*
*labels list<FROZEN>,*
* PRIMARY KEY ( project_id, member_id, ticket_id )*
*);*

I have a scenario where I need to show all tickets for a particular
project, by a group of member ids.

I think it would be more efficient to do this as an IN query of the
type: *project_id
= x AND member_id IN (...)*, instead of doing multiple queries of: *project_id
= x AND member_id = y*

I tried to setup an accessor for this, as the following:

*   @Query("SELECT * FROM ticket_by_member WHERE project_id = ? AND
member_id IN(?)" )*

*Result cardsByMembers(String projectId,
List memberIds);*

But when I call this method, I get the exception:

 java.util.concurrent.ExecutionException:
com.datastax.driver.core.exceptions.InvalidQueryException: Cannot restrict
clustering columns by IN relations when a collection is selected by the
query

Any ideas on why this isn't working?

Re: Cannot restrict clustering columns by IN relations when a collection is selected by the query

2016-10-23 Thread Samba

please see CASSANDRA-12654

On Sat, Oct 22, 2016 at 3:12 AM, DuyHai Doan <doanduy...@gmail.com> wrote:

> So the commit on this restriction dates back to 2.2.0 (CASSANDRA-7981).
>
> Maybe Benjamin Lerer can shed some light on it.
>
> On Fri, Oct 21, 2016 at 11:05 PM, Jeff Carpenter <
> jeff.carpen...@choicehotels.com> wrote:
>
>> Hello
>>
>> Consider the following schema:
>>
>> CREATE TABLE rates_by_code (
>>   hotel_id text,
>>   rate_code text,
>>   rates set,
>>   description text,
>>   PRIMARY KEY ((hotel_id), rate_code)
>> );
>>
>> When executing the query:
>>
>> select rates from rates_by_code where hotel_id='AZ123' and rate_code IN
>> ('ABC', 'DEF', 'GHI');
>>
>> I receive the response message:
>>
>> Cannot restrict clustering columns by IN relations when a collection is
>> selected by the query.
>>
>> If I select a non-collection column such as "description", no error
>> occurs.
>>
>> Why does this restriction exist? Is this a restriction that is still
>> necessary given the new storage engine? (I have verified this on both 2.2.5
>> and 3.0.9.)
>>
>> I looked for a Jira issue related to this topic, but nothing obvious
>> popped up. I'd be happy to create one, though.
>>
>> Thanks
>> Jeff Carpenter
>>
>>
>>
>>
>

Re: Cannot restrict clustering columns by IN relations when a collection is selected by the query

2016-10-21 Thread DuyHai Doan

So the commit on this restriction dates back to 2.2.0 (CASSANDRA-7981).

Maybe Benjamin Lerer can shed some light on it.

On Fri, Oct 21, 2016 at 11:05 PM, Jeff Carpenter <
jeff.carpen...@choicehotels.com> wrote:

> Hello
>
> Consider the following schema:
>
> CREATE TABLE rates_by_code (
>   hotel_id text,
>   rate_code text,
>   rates set,
>   description text,
>   PRIMARY KEY ((hotel_id), rate_code)
> );
>
> When executing the query:
>
> select rates from rates_by_code where hotel_id='AZ123' and rate_code IN
> ('ABC', 'DEF', 'GHI');
>
> I receive the response message:
>
> Cannot restrict clustering columns by IN relations when a collection is
> selected by the query.
>
> If I select a non-collection column such as "description", no error occurs.
>
> Why does this restriction exist? Is this a restriction that is still
> necessary given the new storage engine? (I have verified this on both 2.2.5
> and 3.0.9.)
>
> I looked for a Jira issue related to this topic, but nothing obvious
> popped up. I'd be happy to create one, though.
>
> Thanks
> Jeff Carpenter
>
>
>
>

Cannot restrict clustering columns by IN relations when a collection is selected by the query

2016-10-21 Thread Jeff Carpenter

Hello

Consider the following schema:

CREATE TABLE rates_by_code (
  hotel_id text,
  rate_code text,
  rates set,
  description text,
  PRIMARY KEY ((hotel_id), rate_code)
);

When executing the query:

select rates from rates_by_code where hotel_id='AZ123' and rate_code IN ('ABC', 
'DEF', 'GHI');

I receive the response message:

Cannot restrict clustering columns by IN relations when a collection is 
selected by the query.

If I select a non-collection column such as "description", no error occurs.

Why does this restriction exist? Is this a restriction that is still necessary 
given the new storage engine? (I have verified this on both 2.2.5 and 3.0.9.)

I looked for a Jira issue related to this topic, but nothing obvious popped up. 
I'd be happy to create one, though.

Thanks
Jeff Carpenter



<>

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread Justin Cameron

I'm not sure about using it in a SimpleStatement in the Java driver (you
might need to test this), but the QueryBuilder does have support for in()
where you pass a list is the parameter:

see
http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/querybuilder/QueryBuilder.html#in-java.lang.String-java.util.List-


On Tue, 11 Oct 2016 at 07:24 Ali Akhtar <ali.rac...@gmail.com> wrote:

Justin,

I'm asking how to bind a parameter for IN queries thru the java driver.

On Tue, Oct 11, 2016 at 7:22 PM, Justin Cameron <jus...@instaclustr.com>
wrote:

You need to specify the values themselves.

CREATE TABLE user (
id int,
type text,
val1 int,
val2 text,
PRIMARY KEY ((id, category), val1, val2)
);

SELECT * FROM user WHERE id = 1 AND type IN ('user', 'admin') AND val1 = 3
AND val2 IN ('a', 'v', 'd');

On Tue, 11 Oct 2016 at 07:11 Ali Akhtar <ali.rac...@gmail.com> wrote:

Do you send the values themselves, or send them as an array / collection?
Or will both work?

On Tue, Oct 11, 2016 at 7:10 PM, Justin Cameron <jus...@instaclustr.com>
wrote:

You can pass multiple values to the IN clause, however they can only be
used on the last column in the partition key and/or the last column in the
full primary key.

Example:

'Select * from my_table WHERE pk = 'test' And ck IN (1, 2)'


On Tue, 11 Oct 2016 at 06:15 Ali Akhtar <ali.rac...@gmail.com> wrote:

If I wanted to create an accessor, and have a method which does a query
like this:

'Select * from my_table WHERE pk = ? And ck IN (?)'

And there were multiple options that could go inside the IN() query, how
can I specify that? Will it e.g, let me pass in an array as the 2nd
variable?

-- 

Justin Cameron

Senior Software Engineer | Instaclustr




This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.





-- 

Justin Cameron

Senior Software Engineer | Instaclustr




This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.





-- 

Justin Cameron

Senior Software Engineer | Instaclustr




This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread Justin Cameron

You need to specify the values themselves.

CREATE TABLE user (
id int,
type text,
val1 int,
val2 text,
PRIMARY KEY ((id, category), val1, val2)
);

SELECT * FROM user WHERE id = 1 AND type IN ('user', 'admin') AND val1 = 3
AND val2 IN ('a', 'v', 'd');

On Tue, 11 Oct 2016 at 07:11 Ali Akhtar <ali.rac...@gmail.com> wrote:

Do you send the values themselves, or send them as an array / collection?
Or will both work?

On Tue, Oct 11, 2016 at 7:10 PM, Justin Cameron <jus...@instaclustr.com>
wrote:

You can pass multiple values to the IN clause, however they can only be
used on the last column in the partition key and/or the last column in the
full primary key.

Example:

'Select * from my_table WHERE pk = 'test' And ck IN (1, 2)'


On Tue, 11 Oct 2016 at 06:15 Ali Akhtar <ali.rac...@gmail.com> wrote:

If I wanted to create an accessor, and have a method which does a query
like this:

'Select * from my_table WHERE pk = ? And ck IN (?)'

And there were multiple options that could go inside the IN() query, how
can I specify that? Will it e.g, let me pass in an array as the 2nd
variable?

-- 

Justin Cameron

Senior Software Engineer | Instaclustr




This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.





-- 

Justin Cameron

Senior Software Engineer | Instaclustr




This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread Ali Akhtar

Justin,

I'm asking how to bind a parameter for IN queries thru the java driver.

On Tue, Oct 11, 2016 at 7:22 PM, Justin Cameron <jus...@instaclustr.com>
wrote:

> You need to specify the values themselves.
>
> CREATE TABLE user (
> id int,
> type text,
> val1 int,
> val2 text,
> PRIMARY KEY ((id, category), val1, val2)
> );
>
> SELECT * FROM user WHERE id = 1 AND type IN ('user', 'admin') AND val1 =
> 3 AND val2 IN ('a', 'v', 'd');
>
> On Tue, 11 Oct 2016 at 07:11 Ali Akhtar <ali.rac...@gmail.com> wrote:
>
> Do you send the values themselves, or send them as an array / collection?
> Or will both work?
>
> On Tue, Oct 11, 2016 at 7:10 PM, Justin Cameron <jus...@instaclustr.com>
> wrote:
>
> You can pass multiple values to the IN clause, however they can only be
> used on the last column in the partition key and/or the last column in the
> full primary key.
>
> Example:
>
> 'Select * from my_table WHERE pk = 'test' And ck IN (1, 2)'
>
>
> On Tue, 11 Oct 2016 at 06:15 Ali Akhtar <ali.rac...@gmail.com> wrote:
>
> If I wanted to create an accessor, and have a method which does a query
> like this:
>
> 'Select * from my_table WHERE pk = ? And ck IN (?)'
>
> And there were multiple options that could go inside the IN() query, how
> can I specify that? Will it e.g, let me pass in an array as the 2nd
> variable?
>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
>
>
>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread Ali Akhtar

Ah, thanks, good catch.

If I send a List / Array as value for the last param, will that get bound
as expected?

On Tue, Oct 11, 2016 at 7:16 PM, horschi <hors...@gmail.com> wrote:

> Hi Ali,
>
> do you perhaps want "'Select * from my_table WHERE pk = ? And ck IN ?'" ?
> (Without the brackets around the question mark)
>
> regards,
> Ch
>
> On Tue, Oct 11, 2016 at 3:14 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
>> If I wanted to create an accessor, and have a method which does a query
>> like this:
>>
>> 'Select * from my_table WHERE pk = ? And ck IN (?)'
>>
>> And there were multiple options that could go inside the IN() query, how
>> can I specify that? Will it e.g, let me pass in an array as the 2nd
>> variable?
>>
>
>

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread Ali Akhtar

Do you send the values themselves, or send them as an array / collection?
Or will both work?

On Tue, Oct 11, 2016 at 7:10 PM, Justin Cameron <jus...@instaclustr.com>
wrote:

> You can pass multiple values to the IN clause, however they can only be
> used on the last column in the partition key and/or the last column in the
> full primary key.
>
> Example:
>
> 'Select * from my_table WHERE pk = 'test' And ck IN (1, 2)'
>
>
> On Tue, 11 Oct 2016 at 06:15 Ali Akhtar <ali.rac...@gmail.com> wrote:
>
>> If I wanted to create an accessor, and have a method which does a query
>> like this:
>>
>> 'Select * from my_table WHERE pk = ? And ck IN (?)'
>>
>> And there were multiple options that could go inside the IN() query, how
>> can I specify that? Will it e.g, let me pass in an array as the 2nd
>> variable?
>>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread horschi

Hi Ali,

do you perhaps want "'Select * from my_table WHERE pk = ? And ck IN ?'" ?
(Without the brackets around the question mark)

regards,
Ch

On Tue, Oct 11, 2016 at 3:14 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> If I wanted to create an accessor, and have a method which does a query
> like this:
>
> 'Select * from my_table WHERE pk = ? And ck IN (?)'
>
> And there were multiple options that could go inside the IN() query, how
> can I specify that? Will it e.g, let me pass in an array as the 2nd
> variable?
>

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread Justin Cameron

You can pass multiple values to the IN clause, however they can only be
used on the last column in the partition key and/or the last column in the
full primary key.

Example:

'Select * from my_table WHERE pk = 'test' And ck IN (1, 2)'

On Tue, 11 Oct 2016 at 06:15 Ali Akhtar <ali.rac...@gmail.com> wrote:

> If I wanted to create an accessor, and have a method which does a query
> like this:
>
> 'Select * from my_table WHERE pk = ? And ck IN (?)'
>
> And there were multiple options that could go inside the IN() query, how
> can I specify that? Will it e.g, let me pass in an array as the 2nd
> variable?
>
-- 

Justin Cameron

Senior Software Engineer | Instaclustr

This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread Ali Akhtar

If I wanted to create an accessor, and have a method which does a query
like this:

'Select * from my_table WHERE pk = ? And ck IN (?)'

And there were multiple options that could go inside the IN() query, how
can I specify that? Will it e.g, let me pass in an array as the 2nd
variable?

Re: Doing a calculation in a query?

2016-10-10 Thread DuyHai Doan

Assuming you're using Cassandra 3.0 or more, User Defined Functions (UDF)
can help you to compute the shipment_delay. For the ordering, since this
column is computed and not a clustering column, ordering won't be possible

More details about UDF: http://www.doanduyhai.com/blog/?p=1876

On Mon, Oct 10, 2016 at 6:08 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> I have a table for tracking orders. Each order has an `ordered_at` field
> (can be a timestamp, or a long with the milliseconds of the timestamp) and
> `shipped_at` field (ditto, timestamp or long).
>
> orderd_at tracks when the order was made.
>
> shipped_at tracks when the order was shipped.
>
> When retrieving the orders, I need to calculate an additional field,
> called 'shipment_delay'. This is simply, 'shipped_at - ordered_at`. I.e how
> long it took between when the order was made, and when it was shipped.
>
> The tricky part is, that if an order isn't yet shipped, then it should
> just return how many days it has been since the order was made.
>
> E.g, if order was made on Jan 1 and shipped on Jan 5th, shipment_delay = 4
>  days (in milliseconds if needed)
>
> If order made on Jan 1, but not yet shipped, and today is Jan 10th, then
> shipment_delay = 10 days.
>
> I then need to sort the orders in the order of 'shipment_delay desc', i.e
> show the orders which took the longest, at the top.
>
> Is it possible to define 'shipment_delay' at the table or query level, so
> it can be used in the 'order by' clause, or if this ordering will have to
> be done myself after the data is received?
>
> Thanks.
>
>

Doing a calculation in a query?

2016-10-10 Thread Ali Akhtar

I have a table for tracking orders. Each order has an `ordered_at` field
(can be a timestamp, or a long with the milliseconds of the timestamp) and
`shipped_at` field (ditto, timestamp or long).

orderd_at tracks when the order was made.

shipped_at tracks when the order was shipped.

When retrieving the orders, I need to calculate an additional field, called
'shipment_delay'. This is simply, 'shipped_at - ordered_at`. I.e how long
it took between when the order was made, and when it was shipped.

The tricky part is, that if an order isn't yet shipped, then it should just
return how many days it has been since the order was made.

E.g, if order was made on Jan 1 and shipped on Jan 5th, shipment_delay = 4
 days (in milliseconds if needed)

If order made on Jan 1, but not yet shipped, and today is Jan 10th, then
shipment_delay = 10 days.

I then need to sort the orders in the order of 'shipment_delay desc', i.e
show the orders which took the longest, at the top.

Is it possible to define 'shipment_delay' at the table or query level, so
it can be used in the 'order by' clause, or if this ordering will have to
be done myself after the data is received?

Thanks.

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-10-04 Thread Mikhail Krupitskiy

Please see my comments inline.

Thanks,
Mikhail
> On 26 Sep 2016, at 17:07, DuyHai Doan  wrote:
> 
> "In the current implementation (‘%’ could be a wildcard only at the start/end 
> of a term) I guess it should be ’ENDS with ‘%escape’ ‘." 
> 
> --> Yes in the current impl, it means ENDS WITH '%escape' but we want SASI to 
> understand the %% as an escape for % so the goal is that SASI understands 
> LIKE '%%escape' as EQUALS TO '%escape'. Am I correct ?
I guess that the goal is to define a way to use ‘%’ as a simple char.
LIKE '%escape' - ENDS WITH 'escape'
LIKE '%%escape' - EQUALS TO '%escape’
LIKE '%%escape%' - STARTS WITH '%escape’

LIKE ‘%%%escape’ - undefined in general case
LIKE ‘%%%escape’ - ENDS WITH “%escape” in a case when we know that a wildcard 
could be only at the start/end.
> 
> "Moreover all terms that contains single ‘%’ somewhere in the middle should 
> cause an exception."
> 
> --> Not necessarily, sometime people may want to search text pattern 
> containing the literal %. Imagine the text "this year the average income has 
> increase by 10%". People may want to search for "10%”.
If someone wants to search for ’10%’ then he should escape the ‘%’ char: like 
“10%%”.
> 
> 
> 
> "BUT may be it’s better to make escaping more universal to support a future 
> possible case where a wildcard could be placed in the middle of a term too?"
> 
> --> I guess universal escaping for % is the cleaner and better solution. 
> However it may involve some complex regular expression. I'm not sure that 
> input.replaceAll("%%", "%") trick would work for any cases.
As I wrote I don’t think that it’s possible to do universal escaping using ‘%’ 
as an escape char (a char to escape wildcard char to make it a simple char 
semantically) and as wildcard at the same time.
I suggest to use “\” as an escape char.
Also I don’t know enough about Cassandra’s internals to estimate how universal 
escaping will affect performance.
It really looks like a better solution for Cassandra users.
> 
> And we also need to define when we want to detect operation type 
> (LIKE_PREFIX, LIKE_SUFFIX, LIKE_CONTAINS, EQUAL) ? 
> 
> Should we detect operation type BEFORE escaping or AFTER escaping ?
As I understand ‘escaping' will be done by users. 
So on DB level we get an already escaped string from a request and it’s 
possible to know which symbol is a wildcard and which is just a char.
I guess that Cassandra should parse (unescape?) an incoming string to define 
wildcards positions and simple chars positions and then define an operation 
type.

 
> 
> 
> 
> 
> 
> On Mon, Sep 26, 2016 at 3:54 PM, Mikhail Krupitskiy 
> > 
> wrote:
>> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
> In the current implementation (‘%’ could be a wildcard only at the start/end 
> of a term) I guess it should be ’ENDS with ‘%escape’ ‘.
> Moreover all terms that contains single ‘%’ somewhere in the middle should 
> cause an exception.
> BUT may be it’s better to make escaping more universal to support a future 
> possible case where a wildcard could be placed in the middle of a term too?
> 
> Thanks,
> Mikhail 
>> On 24 Sep 2016, at 21:09, DuyHai Doan > > wrote:
>> 
>> Reminder, right now, the % character is only interpreted as wildcard IF AND 
>> ONLY IF it is the first/last character of the searched term
>> 
>> 
>> LIKE '%escape' --> ENDS WITH 'escape' 
>> 
>> If we use % to escape %,
>> LIKE '%%escape' -->  EQUALS TO '%escape'
>> 
>> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
>> 
>> 
>> 
>> 
>> On Fri, Sep 23, 2016 at 5:02 PM, Mikhail Krupitskiy 
>> > 
>> wrote:
>> Hi, Jim,
>> 
>> What pattern should be used to search “ends with  ‘%escape’ “ with your 
>> conception?
>> 
>> Thanks,
>> Mikhail
>> 
>>> On 22 Sep 2016, at 18:51, Jim Ancona >> > wrote:
>>> 
>>> To answer DuyHai's question without introducing new syntax, I'd suggest:
 LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape' 
>>> So the first two %'s are translated to a literal, non-wildcard % and the 
>>> third % is a wildcard because it's not doubled.
>>> 
>>> Jim
>>> 
>>> On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy 
>>> >> > wrote:
>>> I guess that it should be similar to how it is done in SQL for LIKE 
>>> patterns.
>>> 
>>> You can introduce an escape character, e.g. ‘\’.
>>> Examples:
>>> ‘%’ - any string
>>> ‘\%’ - equal to ‘%’ character
>>> ‘\%foo%’ - starts from ‘%foo’
>>> ‘%%%escape’ - ends with ’escape’
>>> ‘\%%’ - starts from ‘%’
>>> ‘\\\%%’ - starts from ‘\%’ .
>>> 
>>> What do you think?
>>> 
>>> Thanks,
>>> Mikhail
 On 22 Sep 2016, at 16:47, DuyHai Doan

Contains-query leads to error when list in selected row is empty

2016-09-28 Thread Michael Mirwaldt

Hi Cassandra-users,
my name is Michael Mirwaldt and I work for financial.com.

I have encountered this problem with Cassandra 3.7 running 4 nodes:

Given the data model

CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '2'}  AND durable_writes = true;

CREATE TABLE mykeyspace.mytable (partitionkey text, mylist list, PRIMARY 
KEY (partitionkey));

If I add the value

INSERT INTO mykeyspace.mytable(partitionkey,mylist) VALUES('A',['1']);

and query

select * from mykeyspace.mytable;

I get

partitionkey | mylist
--+
A |  ['1']

and If I query

select * from mykeyspace.mytable where partitionkey='A' and mylist contains '1' 
allow filtering;

I get

partitionkey | mylist
--+
A |  ['1']


But if I add

INSERT INTO mykeyspace.mytable(partitionkey) VALUES('B');

so that

select * from mykeyspace.mytable;

gives me

partitionkey | mylist
--+
B |   null
A |  ['1']

then

select * from mykeyspace.mytable where partitionkey='B' and mylist contains '2' 
allow filtering;

leads to the error message

ReadFailure: code=1300 [Replica(s) failed to execute read] message="Operation 
failed - received 0 responses and 2 failures" info={'failures': 2, 
'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

with the log message

[...] AwareExecutorService$FutureTask.run|Uncaught exception on thread 
Thread[SharedPool-Worker-2,5,main]: java.lang.RuntimeException: 
java.lang.NullPointerException

on one other node.

Is that really logical and intended?
Would you not expect an empty result for last query?

I am confused.
Can you help me?

Brgds,
Michael



financial.com AG

Munich Head Office/Hauptsitz M?nchen: Georg-Muche-Stra?e 3 | 80807 Munich | 
Germany | Tel. +49 89 318528-0 | Google Maps: http://goo.gl/maps/UHwj9
Frankfurt Branch Office/Niederlassung Frankfurt: Messeturm | 
Friedrich-Ebert-Anlage 49 | 60327 Frankfurt am Main | Germany | Google Maps: 
http://goo.gl/maps/oSGjR
Management Board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. 
Yann Samson
Supervisory Board/Aufsichtsrat: Werner Engelhardt (Chairman/Vorsitzender), Eric 
Wasescha (Deputy Chairman/Stellv. Vorsitzender), Franz Baur
Register Court/Handelsregister: Munich - HRB 128972 | Sales Tax ID 
Number/St.Nr.: DE205370553

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-26 Thread DuyHai Doan

"In the current implementation (‘%’ could be a wildcard only at the
start/end of a term) I guess it should be ’ENDS with ‘%escape’ ‘."

--> Yes in the current impl, it means ENDS WITH '%escape' but we want SASI
to understand the %% as an escape for % so the goal is that SASI
understands LIKE '%%escape' as EQUALS TO '%escape'. Am I correct ?

"Moreover all terms that contains single ‘%’ somewhere in the middle should
cause an exception."

--> Not necessarily, sometime people may want to search text pattern
containing the literal %. Imagine the text "this year the average income
has increase by 10%". People may want to search for "10%".


"BUT may be it’s better to make escaping more universal to support a future
possible case where a wildcard could be placed in the middle of a term too?"

--> I guess universal escaping for % is the cleaner and better solution.
However it may involve some complex regular expression. I'm not sure that
input.replaceAll("%%", "%") trick would work for any cases.

And we also need to define when we want to detect operation type
(LIKE_PREFIX, LIKE_SUFFIX, LIKE_CONTAINS, EQUAL) ?

Should we detect operation type BEFORE escaping or AFTER escaping ?





On Mon, Sep 26, 2016 at 3:54 PM, Mikhail Krupitskiy <
mikhail.krupits...@jetbrains.com> wrote:

> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
>
> In the current implementation (‘%’ could be a wildcard only at the
> start/end of a term) I guess it should be ’ENDS with ‘%escape’ ‘.
> Moreover all terms that contains single ‘%’ somewhere in the middle should
> cause an exception.
> BUT may be it’s better to make escaping more universal to support a future
> possible case where a wildcard could be placed in the middle of a term too?
>
> Thanks,
> Mikhail
>
> On 24 Sep 2016, at 21:09, DuyHai Doan  wrote:
>
> Reminder, right now, the % character is only interpreted as wildcard IF
> AND ONLY IF it is the first/last character of the searched term
>
>
> LIKE '%escape' --> ENDS WITH 'escape'
>
> If we use % to escape %,
> LIKE '%%escape' -->  EQUALS TO '%escape'
>
> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
>
>
>
>
> On Fri, Sep 23, 2016 at 5:02 PM, Mikhail Krupitskiy <
> mikhail.krupits...@jetbrains.com> wrote:
>
>> Hi, Jim,
>>
>> What pattern should be used to search “ends with  ‘%escape’ “ with your
>> conception?
>>
>> Thanks,
>> Mikhail
>>
>> On 22 Sep 2016, at 18:51, Jim Ancona  wrote:
>>
>> To answer DuyHai's question without introducing new syntax, I'd suggest:
>>
>> LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape'
>>
>> So the first two %'s are translated to a literal, non-wildcard % and the
>> third % is a wildcard because it's not doubled.
>>
>> Jim
>>
>> On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy <
>> mikhail.krupits...@jetbrains.com> wrote:
>>
>>> I guess that it should be similar to how it is done in SQL for LIKE
>>> patterns.
>>>
>>> You can introduce an escape character, e.g. ‘\’.
>>> Examples:
>>> ‘%’ - any string
>>> ‘\%’ - equal to ‘%’ character
>>> ‘\%foo%’ - starts from ‘%foo’
>>> ‘%%%escape’ - ends with ’escape’
>>> ‘\%%’ - starts from ‘%’
>>> ‘\\\%%’ - starts from ‘\%’ .
>>>
>>> What do you think?
>>>
>>> Thanks,
>>> Mikhail
>>>
>>> On 22 Sep 2016, at 16:47, DuyHai Doan  wrote:
>>>
>>> Hello Mikhail
>>>
>>> It's more complicated that it seems
>>>
>>> LIKE '%%escape' means  EQUAL TO '%escape'
>>>
>>> LIKE '%escape' means ENDS WITH 'escape'
>>>
>>> What's about LIKE '%%%escape' 
>>>
>>> How should we treat this case ? Replace %% by % at the beginning of the
>>> searched term ??
>>>
>>>
>>>
>>> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy <
>>> mikhail.krupits...@jetbrains.com> wrote:
>>>
 Hi!

 We’ve talked about two items:
 1) ‘%’ as a wildcard in the middle of LIKE pattern.
 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with
 help of LIKE.

 Item #1was resolved as CASSANDRA-12573.

 Regarding to item #2: you said the following:

 A possible fix would be:

 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending
 on the column data type)
 2) remove the escape character e.g. before parsing OR use some advanced
 regex to exclude the %% from parsing e.g

 Step 2) is dead easy but step 1) is harder because I don't know if
 converting the bytebuffer into String at this stage of the CQL parser is
 expensive or not (in term of computation)

 Let me try a patch

 So is there any update on this?

 Thanks,
 Mikhail


 On 20 Sep 2016, at 18:38, Mikhail Krupitskiy <
 mikhail.krupits...@jetbrains.com> wrote:

 Hi!

 Have you had a chance to try your patch or solve the issue in an other
 way?

 Thanks,
 Mikhail

 On 15 Sep 2016, at 16:02, DuyHai Doan  wrote:

 Ok so I've found the source of the issue,

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-26 Thread Mikhail Krupitskiy

> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
In the current implementation (‘%’ could be a wildcard only at the start/end of 
a term) I guess it should be ’ENDS with ‘%escape’ ‘.
Moreover all terms that contains single ‘%’ somewhere in the middle should 
cause an exception.
BUT may be it’s better to make escaping more universal to support a future 
possible case where a wildcard could be placed in the middle of a term too?

Thanks,
Mikhail 
> On 24 Sep 2016, at 21:09, DuyHai Doan  wrote:
> 
> Reminder, right now, the % character is only interpreted as wildcard IF AND 
> ONLY IF it is the first/last character of the searched term
> 
> 
> LIKE '%escape' --> ENDS WITH 'escape' 
> 
> If we use % to escape %,
> LIKE '%%escape' -->  EQUALS TO '%escape'
> 
> LIKE '%%%escape' --> EQUALS TO '%%escape' ???
> 
> 
> 
> 
> On Fri, Sep 23, 2016 at 5:02 PM, Mikhail Krupitskiy 
> > 
> wrote:
> Hi, Jim,
> 
> What pattern should be used to search “ends with  ‘%escape’ “ with your 
> conception?
> 
> Thanks,
> Mikhail
> 
>> On 22 Sep 2016, at 18:51, Jim Ancona > > wrote:
>> 
>> To answer DuyHai's question without introducing new syntax, I'd suggest:
>>> LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape' 
>> So the first two %'s are translated to a literal, non-wildcard % and the 
>> third % is a wildcard because it's not doubled.
>> 
>> Jim
>> 
>> On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy 
>> > 
>> wrote:
>> I guess that it should be similar to how it is done in SQL for LIKE patterns.
>> 
>> You can introduce an escape character, e.g. ‘\’.
>> Examples:
>> ‘%’ - any string
>> ‘\%’ - equal to ‘%’ character
>> ‘\%foo%’ - starts from ‘%foo’
>> ‘%%%escape’ - ends with ’escape’
>> ‘\%%’ - starts from ‘%’
>> ‘\\\%%’ - starts from ‘\%’ .
>> 
>> What do you think?
>> 
>> Thanks,
>> Mikhail
>>> On 22 Sep 2016, at 16:47, DuyHai Doan >> > wrote:
>>> 
>>> Hello Mikhail
>>> 
>>> It's more complicated that it seems
>>> 
>>> LIKE '%%escape' means  EQUAL TO '%escape'
>>> 
>>> LIKE '%escape' means ENDS WITH 'escape'
>>> 
>>> What's about LIKE '%%%escape' 
>>> 
>>> How should we treat this case ? Replace %% by % at the beginning of the 
>>> searched term ??
>>> 
>>> 
>>> 
>>> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy 
>>> >> > wrote:
>>> Hi!
>>> 
>>> We’ve talked about two items:
>>> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
>>> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with help 
>>> of LIKE.
>>> 
>>> Item #1was resolved as CASSANDRA-12573.
>>> 
>>> Regarding to item #2: you said the following:
 A possible fix would be:
 
 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on 
 the column data type)
 2) remove the escape character e.g. before parsing OR use some advanced 
 regex to exclude the %% from parsing e.g
 
 Step 2) is dead easy but step 1) is harder because I don't know if 
 converting the bytebuffer into String at this stage of the CQL parser is 
 expensive or not (in term of computation)
 
 Let me try a patch 
>>> 
>>> So is there any update on this?
>>> 
>>> Thanks,
>>> Mikhail
>>> 
>>> 
 On 20 Sep 2016, at 18:38, Mikhail Krupitskiy 
 > wrote:
 
 Hi!
 
 Have you had a chance to try your patch or solve the issue in an other 
 way? 
 
 Thanks,
 Mikhail
> On 15 Sep 2016, at 16:02, DuyHai Doan  > wrote:
> 
> Ok so I've found the source of the issue, it's pretty well hidden because 
> it is NOT in the SASI source code directly.
> 
> Here is the method where C* determines what kind of LIKE expression 
> you're using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
> 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>  
> 
> 
> As you can see, it's pretty simple, maybe too simple. Indeed, they forget 
> to remove escape character BEFORE doing the matching so if your search is 
> LIKE '%%esc%', the detected expression is LIKE_CONTAINS.
> 
> A possible fix would be:
> 
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on 
> the column data type)
> 2) remove the escape character e.g. before parsing OR use some advanced 
> regex to exclude the %% from

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-24 Thread DuyHai Doan

Reminder, right now, the % character is only interpreted as wildcard IF AND
ONLY IF it is the first/last character of the searched term


LIKE '%escape' --> ENDS WITH 'escape'

If we use % to escape %,
LIKE '%%escape' -->  EQUALS TO '%escape'

LIKE '%%%escape' --> EQUALS TO '%%escape' ???




On Fri, Sep 23, 2016 at 5:02 PM, Mikhail Krupitskiy <
mikhail.krupits...@jetbrains.com> wrote:

> Hi, Jim,
>
> What pattern should be used to search “ends with  ‘%escape’ “ with your
> conception?
>
> Thanks,
> Mikhail
>
> On 22 Sep 2016, at 18:51, Jim Ancona  wrote:
>
> To answer DuyHai's question without introducing new syntax, I'd suggest:
>
> LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape'
>
> So the first two %'s are translated to a literal, non-wildcard % and the
> third % is a wildcard because it's not doubled.
>
> Jim
>
> On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy <
> mikhail.krupits...@jetbrains.com> wrote:
>
>> I guess that it should be similar to how it is done in SQL for LIKE
>> patterns.
>>
>> You can introduce an escape character, e.g. ‘\’.
>> Examples:
>> ‘%’ - any string
>> ‘\%’ - equal to ‘%’ character
>> ‘\%foo%’ - starts from ‘%foo’
>> ‘%%%escape’ - ends with ’escape’
>> ‘\%%’ - starts from ‘%’
>> ‘\\\%%’ - starts from ‘\%’ .
>>
>> What do you think?
>>
>> Thanks,
>> Mikhail
>>
>> On 22 Sep 2016, at 16:47, DuyHai Doan  wrote:
>>
>> Hello Mikhail
>>
>> It's more complicated that it seems
>>
>> LIKE '%%escape' means  EQUAL TO '%escape'
>>
>> LIKE '%escape' means ENDS WITH 'escape'
>>
>> What's about LIKE '%%%escape' 
>>
>> How should we treat this case ? Replace %% by % at the beginning of the
>> searched term ??
>>
>>
>>
>> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy <
>> mikhail.krupits...@jetbrains.com> wrote:
>>
>>> Hi!
>>>
>>> We’ve talked about two items:
>>> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
>>> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with
>>> help of LIKE.
>>>
>>> Item #1was resolved as CASSANDRA-12573.
>>>
>>> Regarding to item #2: you said the following:
>>>
>>> A possible fix would be:
>>>
>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
>>> the column data type)
>>> 2) remove the escape character e.g. before parsing OR use some advanced
>>> regex to exclude the %% from parsing e.g
>>>
>>> Step 2) is dead easy but step 1) is harder because I don't know if
>>> converting the bytebuffer into String at this stage of the CQL parser is
>>> expensive or not (in term of computation)
>>>
>>> Let me try a patch
>>>
>>> So is there any update on this?
>>>
>>> Thanks,
>>> Mikhail
>>>
>>>
>>> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy <
>>> mikhail.krupits...@jetbrains.com> wrote:
>>>
>>> Hi!
>>>
>>> Have you had a chance to try your patch or solve the issue in an other
>>> way?
>>>
>>> Thanks,
>>> Mikhail
>>>
>>> On 15 Sep 2016, at 16:02, DuyHai Doan  wrote:
>>>
>>> Ok so I've found the source of the issue, it's pretty well hidden
>>> because it is NOT in the SASI source code directly.
>>>
>>> Here is the method where C* determines what kind of LIKE expression
>>> you're using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>>>
>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/
>>> apache/cassandra/cql3/restrictions/SingleColumnRestriction.j
>>> ava#L733-L778
>>>
>>> As you can see, it's pretty simple, maybe too simple. Indeed, they
>>> forget to remove escape character BEFORE doing the matching so if your
>>> search is LIKE '%%esc%', the detected expression is LIKE_CONTAINS.
>>>
>>> A possible fix would be:
>>>
>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
>>> the column data type)
>>> 2) remove the escape character e.g. before parsing OR use some advanced
>>> regex to exclude the %% from parsing e.g
>>>
>>> Step 2) is dead easy but step 1) is harder because I don't know if
>>> converting the bytebuffer into String at this stage of the CQL parser is
>>> expensive or not (in term of computation)
>>>
>>> Let me try a patch
>>>
>>>
>>>
>>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan 
>>> wrote:
>>>
 Ok you're right, I get your point

 LIKE '%%esc%' --> startWith('%esc')

 LIKE 'escape%%' -->  = 'escape%'

 What I strongly suspect is that in the source code of SASI, we parse
 the % xxx % expression BEFORE applying escape. That will explain the
 observed behavior. E.g:

 LIKE '%%esc%'  parsed as %xxx% where xxx = %esc

 LIKE 'escape%%' parsed as xxx% where xxx =escape%

 Let me check in the source code and try to reproduce the issue



 On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <
 mikhail.krupits...@jetbrains.com> wrote:

> Looks like we have different understanding of what results are
> expected.
> I based my understanding on http://docs.datastax.com/en
>

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-23 Thread Mikhail Krupitskiy

Hi, Jim,

What pattern should be used to search “ends with  ‘%escape’ “ with your 
conception?

Thanks,
Mikhail
> On 22 Sep 2016, at 18:51, Jim Ancona  wrote:
> 
> To answer DuyHai's question without introducing new syntax, I'd suggest:
>> LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape' 
> So the first two %'s are translated to a literal, non-wildcard % and the 
> third % is a wildcard because it's not doubled.
> 
> Jim
> 
> On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy 
> > 
> wrote:
> I guess that it should be similar to how it is done in SQL for LIKE patterns.
> 
> You can introduce an escape character, e.g. ‘\’.
> Examples:
> ‘%’ - any string
> ‘\%’ - equal to ‘%’ character
> ‘\%foo%’ - starts from ‘%foo’
> ‘%%%escape’ - ends with ’escape’
> ‘\%%’ - starts from ‘%’
> ‘\\\%%’ - starts from ‘\%’ .
> 
> What do you think?
> 
> Thanks,
> Mikhail
>> On 22 Sep 2016, at 16:47, DuyHai Doan > > wrote:
>> 
>> Hello Mikhail
>> 
>> It's more complicated that it seems
>> 
>> LIKE '%%escape' means  EQUAL TO '%escape'
>> 
>> LIKE '%escape' means ENDS WITH 'escape'
>> 
>> What's about LIKE '%%%escape' 
>> 
>> How should we treat this case ? Replace %% by % at the beginning of the 
>> searched term ??
>> 
>> 
>> 
>> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy 
>> > 
>> wrote:
>> Hi!
>> 
>> We’ve talked about two items:
>> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
>> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with help 
>> of LIKE.
>> 
>> Item #1was resolved as CASSANDRA-12573.
>> 
>> Regarding to item #2: you said the following:
>>> A possible fix would be:
>>> 
>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on 
>>> the column data type)
>>> 2) remove the escape character e.g. before parsing OR use some advanced 
>>> regex to exclude the %% from parsing e.g
>>> 
>>> Step 2) is dead easy but step 1) is harder because I don't know if 
>>> converting the bytebuffer into String at this stage of the CQL parser is 
>>> expensive or not (in term of computation)
>>> 
>>> Let me try a patch 
>> 
>> So is there any update on this?
>> 
>> Thanks,
>> Mikhail
>> 
>> 
>>> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy 
>>> >> > wrote:
>>> 
>>> Hi!
>>> 
>>> Have you had a chance to try your patch or solve the issue in an other way? 
>>> 
>>> Thanks,
>>> Mikhail
 On 15 Sep 2016, at 16:02, DuyHai Doan > wrote:
 
 Ok so I've found the source of the issue, it's pretty well hidden because 
 it is NOT in the SASI source code directly.
 
 Here is the method where C* determines what kind of LIKE expression you're 
 using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
 
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
  
 
 
 As you can see, it's pretty simple, maybe too simple. Indeed, they forget 
 to remove escape character BEFORE doing the matching so if your search is 
 LIKE '%%esc%', the detected expression is LIKE_CONTAINS.
 
 A possible fix would be:
 
 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on 
 the column data type)
 2) remove the escape character e.g. before parsing OR use some advanced 
 regex to exclude the %% from parsing e.g
 
 Step 2) is dead easy but step 1) is harder because I don't know if 
 converting the bytebuffer into String at this stage of the CQL parser is 
 expensive or not (in term of computation)
 
 Let me try a patch  
 
 
 
 On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan > wrote:
 Ok you're right, I get your point
 
 LIKE '%%esc%' --> startWith('%esc')
 
 LIKE 'escape%%' -->  = 'escape%'
 
 What I strongly suspect is that in the source code of SASI, we parse the % 
 xxx % expression BEFORE applying escape. That will explain the observed 
 behavior. E.g:
 
 LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
 
 LIKE 'escape%%' parsed as xxx% where xxx =escape%
 
 Let me check in the source code and try to reproduce the issue
 
 
 
 On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy 
 > wrote:
 Looks like we have different understanding of what results are expected.
 I based my

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-22 Thread Jim Ancona

To answer DuyHai's question without introducing new syntax, I'd suggest:

LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape'

So the first two %'s are translated to a literal, non-wildcard % and the
third % is a wildcard because it's not doubled.

Jim

On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy <
mikhail.krupits...@jetbrains.com> wrote:

> I guess that it should be similar to how it is done in SQL for LIKE
> patterns.
>
> You can introduce an escape character, e.g. ‘\’.
> Examples:
> ‘%’ - any string
> ‘\%’ - equal to ‘%’ character
> ‘\%foo%’ - starts from ‘%foo’
> ‘%%%escape’ - ends with ’escape’
> ‘\%%’ - starts from ‘%’
> ‘\\\%%’ - starts from ‘\%’ .
>
> What do you think?
>
> Thanks,
> Mikhail
>
> On 22 Sep 2016, at 16:47, DuyHai Doan  wrote:
>
> Hello Mikhail
>
> It's more complicated that it seems
>
> LIKE '%%escape' means  EQUAL TO '%escape'
>
> LIKE '%escape' means ENDS WITH 'escape'
>
> What's about LIKE '%%%escape' 
>
> How should we treat this case ? Replace %% by % at the beginning of the
> searched term ??
>
>
>
> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy <
> mikhail.krupits...@jetbrains.com> wrote:
>
>> Hi!
>>
>> We’ve talked about two items:
>> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
>> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with
>> help of LIKE.
>>
>> Item #1was resolved as CASSANDRA-12573.
>>
>> Regarding to item #2: you said the following:
>>
>> A possible fix would be:
>>
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
>> the column data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced
>> regex to exclude the %% from parsing e.g
>>
>> Step 2) is dead easy but step 1) is harder because I don't know if
>> converting the bytebuffer into String at this stage of the CQL parser is
>> expensive or not (in term of computation)
>>
>> Let me try a patch
>>
>> So is there any update on this?
>>
>> Thanks,
>> Mikhail
>>
>>
>> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy <
>> mikhail.krupits...@jetbrains.com> wrote:
>>
>> Hi!
>>
>> Have you had a chance to try your patch or solve the issue in an other
>> way?
>>
>> Thanks,
>> Mikhail
>>
>> On 15 Sep 2016, at 16:02, DuyHai Doan  wrote:
>>
>> Ok so I've found the source of the issue, it's pretty well hidden because
>> it is NOT in the SASI source code directly.
>>
>> Here is the method where C* determines what kind of LIKE expression
>> you're using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>>
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/
>> apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>>
>> As you can see, it's pretty simple, maybe too simple. Indeed, they forget
>> to remove escape character BEFORE doing the matching so if your search is 
>> LIKE
>> '%%esc%', the detected expression is LIKE_CONTAINS.
>>
>> A possible fix would be:
>>
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
>> the column data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced
>> regex to exclude the %% from parsing e.g
>>
>> Step 2) is dead easy but step 1) is harder because I don't know if
>> converting the bytebuffer into String at this stage of the CQL parser is
>> expensive or not (in term of computation)
>>
>> Let me try a patch
>>
>>
>>
>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan 
>> wrote:
>>
>>> Ok you're right, I get your point
>>>
>>> LIKE '%%esc%' --> startWith('%esc')
>>>
>>> LIKE 'escape%%' -->  = 'escape%'
>>>
>>> What I strongly suspect is that in the source code of SASI, we parse the
>>> % xxx % expression BEFORE applying escape. That will explain the observed
>>> behavior. E.g:
>>>
>>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>>>
>>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>>>
>>> Let me check in the source code and try to reproduce the issue
>>>
>>>
>>>
>>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <
>>> mikhail.krupits...@jetbrains.com> wrote:
>>>
 Looks like we have different understanding of what results are expected.
 I based my understanding on http://docs.datastax.com/en
 /cql/3.3/cql/cql_using/useSASIIndex.html
 According to the doc ‘esc’ is a pattern for exact match and I guess
 that there is no semantical difference between two LIKE patterns (both of
 patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.

 SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
 *containing* '%esc' so *%esc*apeme is a possible match and also escape
 *%esc*

 Why ‘containing’? I expect that it should be ’starting’..


 SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
 *starting* with 'escape%' so *escape%*me is a valid result and also
 *escape%*esc

 Why ’starting’? I expect that it should be ‘exact matching’.

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-22 Thread Mikhail Krupitskiy

I guess that it should be similar to how it is done in SQL for LIKE patterns.

You can introduce an escape character, e.g. ‘\’.
Examples:
‘%’ - any string
‘\%’ - equal to ‘%’ character
‘\%foo%’ - starts from ‘%foo’
‘%%%escape’ - ends with ’escape’
‘\%%’ - starts from ‘%’
‘\\\%%’ - starts from ‘\%’ .

What do you think?

Thanks,
Mikhail
> On 22 Sep 2016, at 16:47, DuyHai Doan  wrote:
> 
> Hello Mikhail
> 
> It's more complicated that it seems
> 
> LIKE '%%escape' means  EQUAL TO '%escape'
> 
> LIKE '%escape' means ENDS WITH 'escape'
> 
> What's about LIKE '%%%escape' 
> 
> How should we treat this case ? Replace %% by % at the beginning of the 
> searched term ??
> 
> 
> 
> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy 
> > 
> wrote:
> Hi!
> 
> We’ve talked about two items:
> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with help 
> of LIKE.
> 
> Item #1was resolved as CASSANDRA-12573.
> 
> Regarding to item #2: you said the following:
>> A possible fix would be:
>> 
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on the 
>> column data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced 
>> regex to exclude the %% from parsing e.g
>> 
>> Step 2) is dead easy but step 1) is harder because I don't know if 
>> converting the bytebuffer into String at this stage of the CQL parser is 
>> expensive or not (in term of computation)
>> 
>> Let me try a patch 
> 
> So is there any update on this?
> 
> Thanks,
> Mikhail
> 
> 
>> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy 
>> > 
>> wrote:
>> 
>> Hi!
>> 
>> Have you had a chance to try your patch or solve the issue in an other way? 
>> 
>> Thanks,
>> Mikhail
>>> On 15 Sep 2016, at 16:02, DuyHai Doan >> > wrote:
>>> 
>>> Ok so I've found the source of the issue, it's pretty well hidden because 
>>> it is NOT in the SASI source code directly.
>>> 
>>> Here is the method where C* determines what kind of LIKE expression you're 
>>> using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>>> 
>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>>>  
>>> 
>>> 
>>> As you can see, it's pretty simple, maybe too simple. Indeed, they forget 
>>> to remove escape character BEFORE doing the matching so if your search is 
>>> LIKE '%%esc%', the detected expression is LIKE_CONTAINS.
>>> 
>>> A possible fix would be:
>>> 
>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on 
>>> the column data type)
>>> 2) remove the escape character e.g. before parsing OR use some advanced 
>>> regex to exclude the %% from parsing e.g
>>> 
>>> Step 2) is dead easy but step 1) is harder because I don't know if 
>>> converting the bytebuffer into String at this stage of the CQL parser is 
>>> expensive or not (in term of computation)
>>> 
>>> Let me try a patch  
>>> 
>>> 
>>> 
>>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan >> > wrote:
>>> Ok you're right, I get your point
>>> 
>>> LIKE '%%esc%' --> startWith('%esc')
>>> 
>>> LIKE 'escape%%' -->  = 'escape%'
>>> 
>>> What I strongly suspect is that in the source code of SASI, we parse the % 
>>> xxx % expression BEFORE applying escape. That will explain the observed 
>>> behavior. E.g:
>>> 
>>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>>> 
>>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>>> 
>>> Let me check in the source code and try to reproduce the issue
>>> 
>>> 
>>> 
>>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy 
>>> >> > wrote:
>>> Looks like we have different understanding of what results are expected.
>>> I based my understanding on 
>>> http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html 
>>> 
>>> According to the doc ‘esc’ is a pattern for exact match and I guess that 
>>> there is no semantical difference between two LIKE patterns (both of 
>>> patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
>>> 
 SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results 
 containing '%esc' so %escapeme is a possible match and also escape%esc
>>> Why ‘containing’? I expect that it should be ’starting’..
 
 SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results 
 starting with 'escape%' so escape%me is a valid result and also escape%esc
>>> Why ’starting’? I expect

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-22 Thread DuyHai Doan

Hello Mikhail

It's more complicated that it seems

LIKE '%%escape' means  EQUAL TO '%escape'

LIKE '%escape' means ENDS WITH 'escape'

What's about LIKE '%%%escape' 

How should we treat this case ? Replace %% by % at the beginning of the
searched term ??



On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy <
mikhail.krupits...@jetbrains.com> wrote:

> Hi!
>
> We’ve talked about two items:
> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with
> help of LIKE.
>
> Item #1was resolved as CASSANDRA-12573.
>
> Regarding to item #2: you said the following:
>
> A possible fix would be:
>
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
> the column data type)
> 2) remove the escape character e.g. before parsing OR use some advanced
> regex to exclude the %% from parsing e.g
>
> Step 2) is dead easy but step 1) is harder because I don't know if
> converting the bytebuffer into String at this stage of the CQL parser is
> expensive or not (in term of computation)
>
> Let me try a patch
>
> So is there any update on this?
>
> Thanks,
> Mikhail
>
>
> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy <
> mikhail.krupits...@jetbrains.com> wrote:
>
> Hi!
>
> Have you had a chance to try your patch or solve the issue in an other
> way?
>
> Thanks,
> Mikhail
>
> On 15 Sep 2016, at 16:02, DuyHai Doan  wrote:
>
> Ok so I've found the source of the issue, it's pretty well hidden because
> it is NOT in the SASI source code directly.
>
> Here is the method where C* determines what kind of LIKE expression you're
> using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>
> https://github.com/apache/cassandra/blob/trunk/src/java/
> org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#
> L733-L778
>
> As you can see, it's pretty simple, maybe too simple. Indeed, they forget
> to remove escape character BEFORE doing the matching so if your search is LIKE
> '%%esc%', the detected expression is LIKE_CONTAINS.
>
> A possible fix would be:
>
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
> the column data type)
> 2) remove the escape character e.g. before parsing OR use some advanced
> regex to exclude the %% from parsing e.g
>
> Step 2) is dead easy but step 1) is harder because I don't know if
> converting the bytebuffer into String at this stage of the CQL parser is
> expensive or not (in term of computation)
>
> Let me try a patch
>
>
>
> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan  wrote:
>
>> Ok you're right, I get your point
>>
>> LIKE '%%esc%' --> startWith('%esc')
>>
>> LIKE 'escape%%' -->  = 'escape%'
>>
>> What I strongly suspect is that in the source code of SASI, we parse the
>> % xxx % expression BEFORE applying escape. That will explain the observed
>> behavior. E.g:
>>
>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>>
>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>>
>> Let me check in the source code and try to reproduce the issue
>>
>>
>>
>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <
>> mikhail.krupits...@jetbrains.com> wrote:
>>
>>> Looks like we have different understanding of what results are expected.
>>> I based my understanding on http://docs.datastax.com/en
>>> /cql/3.3/cql/cql_using/useSASIIndex.html
>>> According to the doc ‘esc’ is a pattern for exact match and I guess that
>>> there is no semantical difference between two LIKE patterns (both of
>>> patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
>>>
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>>> *%esc*
>>>
>>> Why ‘containing’? I expect that it should be ’starting’..
>>>
>>>
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>>> *starting* with 'escape%' so *escape%*me is a valid result and also
>>> *escape%*esc
>>>
>>> Why ’starting’? I expect that it should be ‘exact matching’.
>>>
>>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it
>>> returns nothing (CASSANDRA-12573).
>>>
>>> What I’m missing?
>>>
>>> Thanks,
>>> Mikhail
>>>
>>> On 13 Sep 2016, at 19:31, DuyHai Doan  wrote:
>>>
>>> CREATE CUSTOM INDEX ON test.escape(val) USING '
>>> org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode':
>>> 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa
>>> si.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
>>>
>>> I don't see any problem in the results you got
>>>
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>>> *%esc*
>>>
>>> Why ‘containing’? I expect that it should be ’starting’..
>>>
>>>
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>>> *starting* with 'escape%' so *escape%*me is a valid result and also
>>>

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-22 Thread Mikhail Krupitskiy

Hi!

We’ve talked about two items:
1) ‘%’ as a wildcard in the middle of LIKE pattern.
2) How to escape ‘%’ to be able to find strings with the ‘%’ char with help of 
LIKE.

Item #1was resolved as CASSANDRA-12573.

Regarding to item #2: you said the following:
> A possible fix would be:
> 
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on the 
> column data type)
> 2) remove the escape character e.g. before parsing OR use some advanced regex 
> to exclude the %% from parsing e.g
> 
> Step 2) is dead easy but step 1) is harder because I don't know if converting 
> the bytebuffer into String at this stage of the CQL parser is expensive or 
> not (in term of computation)
> 
> Let me try a patch 

So is there any update on this?

Thanks,
Mikhail


> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy 
>  wrote:
> 
> Hi!
> 
> Have you had a chance to try your patch or solve the issue in an other way? 
> 
> Thanks,
> Mikhail
>> On 15 Sep 2016, at 16:02, DuyHai Doan > > wrote:
>> 
>> Ok so I've found the source of the issue, it's pretty well hidden because it 
>> is NOT in the SASI source code directly.
>> 
>> Here is the method where C* determines what kind of LIKE expression you're 
>> using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>> 
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>>  
>> 
>> 
>> As you can see, it's pretty simple, maybe too simple. Indeed, they forget to 
>> remove escape character BEFORE doing the matching so if your search is LIKE 
>> '%%esc%', the detected expression is LIKE_CONTAINS.
>> 
>> A possible fix would be:
>> 
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on the 
>> column data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced 
>> regex to exclude the %% from parsing e.g
>> 
>> Step 2) is dead easy but step 1) is harder because I don't know if 
>> converting the bytebuffer into String at this stage of the CQL parser is 
>> expensive or not (in term of computation)
>> 
>> Let me try a patch  
>> 
>> 
>> 
>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan > > wrote:
>> Ok you're right, I get your point
>> 
>> LIKE '%%esc%' --> startWith('%esc')
>> 
>> LIKE 'escape%%' -->  = 'escape%'
>> 
>> What I strongly suspect is that in the source code of SASI, we parse the % 
>> xxx % expression BEFORE applying escape. That will explain the observed 
>> behavior. E.g:
>> 
>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>> 
>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>> 
>> Let me check in the source code and try to reproduce the issue
>> 
>> 
>> 
>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy 
>> > 
>> wrote:
>> Looks like we have different understanding of what results are expected.
>> I based my understanding on 
>> http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html 
>> 
>> According to the doc ‘esc’ is a pattern for exact match and I guess that 
>> there is no semantical difference between two LIKE patterns (both of 
>> patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
>> 
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results 
>>> containing '%esc' so %escapeme is a possible match and also escape%esc
>> Why ‘containing’? I expect that it should be ’starting’..
>>> 
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results 
>>> starting with 'escape%' so escape%me is a valid result and also escape%esc
>> Why ’starting’? I expect that it should be ‘exact matching’.
>> 
>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it returns 
>> nothing (CASSANDRA-12573).
>> 
>> What I’m missing?
>> 
>> Thanks,
>> Mikhail
>> 
>>> On 13 Sep 2016, at 19:31, DuyHai Doan >> > wrote:
>>> 
>>> CREATE CUSTOM INDEX ON test.escape(val) USING 
>>> 'org.apache.cassandra.index.sa 
>>> si.SASIIndex' WITH OPTIONS = 
>>> {'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa 
>>> si.analyzer.NonTokenizingAnalyzer', 
>>> 'case_sensitive': 'false'};
>>> 
>>> I don't see any problem in the results you got
>>> 
>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results 
>>> containing '%esc' so %escapeme is a possible match and also escape%esc
>> Why ‘containing’? I expect that it should be ’starting’..
>>> 
>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results 
>>>

Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-20 Thread Mikhail Krupitskiy

Hi!

Have you had a chance to try your patch or solve the issue in an other way? 

Thanks,
Mikhail
> On 15 Sep 2016, at 16:02, DuyHai Doan  wrote:
> 
> Ok so I've found the source of the issue, it's pretty well hidden because it 
> is NOT in the SASI source code directly.
> 
> Here is the method where C* determines what kind of LIKE expression you're 
> using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
> 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>  
> 
> 
> As you can see, it's pretty simple, maybe too simple. Indeed, they forget to 
> remove escape character BEFORE doing the matching so if your search is LIKE 
> '%%esc%', the detected expression is LIKE_CONTAINS.
> 
> A possible fix would be:
> 
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on the 
> column data type)
> 2) remove the escape character e.g. before parsing OR use some advanced regex 
> to exclude the %% from parsing e.g
> 
> Step 2) is dead easy but step 1) is harder because I don't know if converting 
> the bytebuffer into String at this stage of the CQL parser is expensive or 
> not (in term of computation)
> 
> Let me try a patch  
> 
> 
> 
> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan  > wrote:
> Ok you're right, I get your point
> 
> LIKE '%%esc%' --> startWith('%esc')
> 
> LIKE 'escape%%' -->  = 'escape%'
> 
> What I strongly suspect is that in the source code of SASI, we parse the % 
> xxx % expression BEFORE applying escape. That will explain the observed 
> behavior. E.g:
> 
> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
> 
> LIKE 'escape%%' parsed as xxx% where xxx =escape%
> 
> Let me check in the source code and try to reproduce the issue
> 
> 
> 
> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy 
> > 
> wrote:
> Looks like we have different understanding of what results are expected.
> I based my understanding on 
> http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html 
> 
> According to the doc ‘esc’ is a pattern for exact match and I guess that 
> there is no semantical difference between two LIKE patterns (both of patterns 
> should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
> 
>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results 
>> containing '%esc' so %escapeme is a possible match and also escape%esc
> Why ‘containing’? I expect that it should be ’starting’..
>> 
>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results starting 
>> with 'escape%' so escape%me is a valid result and also escape%esc
> Why ’starting’? I expect that it should be ‘exact matching’.
> 
> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it returns 
> nothing (CASSANDRA-12573).
> 
> What I’m missing?
> 
> Thanks,
> Mikhail
> 
>> On 13 Sep 2016, at 19:31, DuyHai Doan > > wrote:
>> 
>> CREATE CUSTOM INDEX ON test.escape(val) USING 'org.apache.cassandra.index.sa 
>> si.SASIIndex' WITH OPTIONS = {'mode': 
>> 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa 
>> si.analyzer.NonTokenizingAnalyzer', 
>> 'case_sensitive': 'false'};
>> 
>> I don't see any problem in the results you got
>> 
>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results 
>> containing '%esc' so %escapeme is a possible match and also escape%esc
> Why ‘containing’? I expect that it should be ’starting’..
>> 
>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results starting 
>> with 'escape%' so escape%me is a valid result and also escape%esc
> Why ’starting’? I expect that it should be ‘exact matching’.
> 
>> 
>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy 
>> > 
>> wrote:
>> Thanks for the reply.
>> Could you please provide what index definition did you use?
>> With the index from my script I get the following results:
>> 
>> cqlsh:test> select * from escape;
>> 
>>  id | val
>> +---
>>   1 | %escapeme
>>   2 | escape%me
>>   3 | escape%esc
>> 
>> Contains search
>> 
>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>> 
>>  id | val
>> +---
>>   1 | %escapeme
>>   3 | escape%esc
>> (2 rows)
>> 
>> 
>> Prefix search
>> 
>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>> 
>>  id | val
>> +---
>>   2 | escape%me
>>   3 | escape%esc
>> 
>> Thanks,
>> Mikhail 
>> 
>>> On 13 Sep 2016, at 18:16, DuyHai Doan >>

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 1112 matches

Mail list logo