Re: scylladb

2017-03-10 Thread Rakesh Kumar
Cassanda vs Scylla is a valid comparison because they both are compatible.  
Scylla is a drop-in replacement for Cassandra.
Is Aerospike a drop-in replacement for Cassandra? If yes, and only if yes, then 
the comparison is valid with Scylla.



From: Bhuvan Rawal 
To: user@cassandra.apache.org
Sent: Friday, March 10, 2017 11:59 AM
Subject: Re: scylladb

Agreed C++ gives an added advantage to talk to underlying hardware with better 
efficiency, it sound good but can a pice of code written in C++ give 1000% 
throughput than a Java app? Is TPC design 10X more performant than SEDA arch?

And if C/C++ is indeed that fast how can Aerospike (which is itself written in 
C) claim to be 10X faster than Scylla here 
http://www.aerospike.com/benchmarks/scylladb-initial/ ? (Combining your's and 
aerospike's benchmarks it appears that Aerospike is 100X performant than C* - I 
highly doubt that!! )

For a moment lets forget about evaluating 2 different databases, one can 
observe 10X performance difference between a mistuned cassandra cluster and one 
thats tuned as per data model - there are so many Tunables in yaml as well as 
table configs.

Idea is - in order to strengthen your claim, you need to provide complete 
system metrics (Disk, CPU, Network), the OPS increase starts to decay along 
with the configs used. Having plain ops per second and 99p latency is blackbox.

Regards,
Bhuvan

On Fri, Mar 10, 2017 at 12:47 PM, Avi Kivity 
mailto:a...@scylladb.com>> wrote:
ScyllaDB engineer here.

C++ is really an enabling technology here. It is directly responsible for a 
small fraction of the gain by executing faster than Java.  But it is indirectly 
responsible for the gain by allowing us direct control over memory and 
threading.  Just as an example, Scylla starts by taking over almost all of the 
machine's memory, and dynamically assigning it to memtables, cache, and working 
memory needed to handle requests in flight.  Memory is statically partitioned 
across cores, allowing us to exploit NUMA fully.  You can't do these things in 
Java.

I would say the major contributors to Scylla performance are:
 - thread-per-core design
 - replacement of the page cache with a row cache
 - careful attention to many small details, each contributing a little, but 
with a large overall impact

While I'm here I can say that performance is not the only goal here, it is 
stable and predictable performance over varying loads and during maintenance 
operations like repair, without any special tuning.  We measure the amount of 
CPU and I/O spent on foreground (user) and background (maintenance) tasks and 
divide them fairly.  This work is not complete but already makes operating 
Scylla a lot simpler.


On 03/10/2017 01:42 AM, Kant Kodali wrote:
I dont think ScyllaDB performance is because of C++. The design decisions in 
scylladb are indeed different from Cassandra such as getting rid of SEDA and 
moving to TPC and so on.

If someone thinks it is because of C++ then just show the benchmarks that 
proves it is indeed the C++ which gave 10X performance boost as ScyllaDB claims 
instead of stating it.


On Thu, Mar 9, 2017 at 3:22 PM, Richard L. Burton III 
mailto:mrbur...@gmail.com>> wrote:
They spend an enormous amount of time focusing on performance. You can expect 
them to continue on with their optimization and keep crushing it.

P.S., I don't work for ScyllaDB.

On Thu, Mar 9, 2017 at 6:02 PM, Rakesh Kumar 
mailto:rakeshkumar...@outlook.com>> wrote:
In all of their presentation they keep harping on the fact that scylladb is 
written in C++ and does not carry the overhead of Java.  Still the difference 
looks staggering.
__ __
From: daemeon reiydelle mailto:daeme...@gmail.com>>
Sent: Thursday, March 9, 2017 14:21
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: scylladb

The comparison is fair, and conservative. Did substantial performance 
comparisons for two clients, both results returned throughputs that were faster 
than the published comparisons (15x as I recall). At that time the client 
preferred to utilize a Cass COTS solution and use a caching solution for OLA 
compliance.


...

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872

On Thu, Mar 9, 2017 at 11:04 AM, Robin Verlangen 
mailto:ro...@us2.nl><mailto:robin@us2 .nl<mailto:ro...@us2.nl>>> 
wrote:
I was wondering how people feel about the comparison that's made here between 
Cassandra and ScyllaDB : http://www.scylladb.com/techno 
logy/ycsb-cassandra-scylla/#re sults-of-3-scylla-nodes-vs-30- 
cassandra-nodes<http://www.scylladb.com/technology/ycsb-cassandra-scylla/#results-of-3-scylla-nodes-vs-30-cassandra-nodes>

They are claiming a 10x improvement, is that a fair comparison or maybe a 
somewhat coloured view of a (micro)benchmark in a specific setup? Any pros/cons 

Re: scylladb

2017-03-09 Thread Rakesh Kumar
In all of their presentation they keep harping on the fact that scylladb is 
written in C++ and does not carry the overhead of Java.  Still the difference 
looks staggering.

From: daemeon reiydelle 
Sent: Thursday, March 9, 2017 14:21
To: user@cassandra.apache.org
Subject: Re: scylladb

The comparison is fair, and conservative. Did substantial performance 
comparisons for two clients, both results returned throughputs that were faster 
than the published comparisons (15x as I recall). At that time the client 
preferred to utilize a Cass COTS solution and use a caching solution for OLA 
compliance.


...

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872

On Thu, Mar 9, 2017 at 11:04 AM, Robin Verlangen 
mailto:ro...@us2.nl>> wrote:
I was wondering how people feel about the comparison that's made here between 
Cassandra and ScyllaDB : 
http://www.scylladb.com/technology/ycsb-cassandra-scylla/#results-of-3-scylla-nodes-vs-30-cassandra-nodes

They are claiming a 10x improvement, is that a fair comparison or maybe a 
somewhat coloured view of a (micro)benchmark in a specific setup? Any pros/cons 
known?

Best regards,

Robin Verlangen
Chief Data Architect

Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.

On Wed, Dec 16, 2015 at 11:52 AM, Carlos Rolo 
mailto:r...@pythian.com>> wrote:
No rain at all! But I almost had it running last weekend, but stopped short of 
installing it. Let's see if this one is for real!

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: @cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolo
Mobile: +351 91 891 81 00 | Tel: +1 613 565 8696 
x1649
www.pythian.com

On Wed, Dec 16, 2015 at 12:38 AM, Dani Traphagen 
mailto:dani.trapha...@datastax.com>> wrote:
You'll be the first Carlos.

[Inline image 1]

Had any rain lately? Curious how this went, if so.

On Thu, Nov 12, 2015 at 4:36 AM, Jack Krupansky 
mailto:jack.krupan...@gmail.com>> wrote:
I just did a Twitter search on scylladb and did not see any tweets about actual 
use, so far.


-- Jack Krupansky

On Wed, Nov 11, 2015 at 10:54 AM, Carlos Alonso 
mailto:i...@mrcalonso.com>> wrote:
Any update about this?

@Carlos Rolo, did you tried it? Thoughts?

Carlos Alonso | Software Engineer | @calonso

On 5 November 2015 at 14:07, Carlos Rolo 
mailto:r...@pythian.com>> wrote:
Something to do on a expected rainy weekend. Thanks for the information.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: @cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolo
Mobile: +351 91 891 81 00 | Tel: +1 613 565 
8696 x1649
www.pythian.com

On Thu, Nov 5, 2015 at 12:07 PM, Dani Traphagen 
mailto:dani.trapha...@datastax.com>> wrote:
As of two days ago, they say they've got it @cjrolo.

https://github.com/scylladb/scylla/wiki/RELEASE-Scylla-0.11-Beta


On Thursday, November 5, 2015, Carlos Rolo 
mailto:r...@pythian.com>> wrote:
I will not try until multi-DC is implemented. More than an month has passed 
since I looked for it, so it could possibly be in place, if so I may take some 
time to test it.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: @cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolo
Mobile: +351 91 891 81 00 | Tel: +1 613 565 
8696 x1649
www.pythian.com

On Thu, Nov 5, 2015 at 9:37 AM, Jon Haddad  wrote:
Nope, no one I know.  Let me know if you try it I'd love to hear your feedback.

> On Nov 5, 2015, at 9:22 AM, tommaso barbugli  wrote:
>
> Hi guys,
>
> did anyone already try Scylladb (yet another fastest NoSQL database in town) 
> and has some thoughts/hands-on experience to share?
>
> Cheers,
> Tommaso




--




--
Sent from mobile -- apologizes for brevity or errors.



--







--
[datastax_logo.png]

DANI TRAPHAGEN

Technical Enablement Lead | 
dani.trapha...@datastax.com

[twitter.png] [linkedin.png] 
  
[https://lh5.googleusercontent.com/WcFJcWZHKXnxu01V6zJIQapcGonoazqsv8O7_DtfhW-qbTRHxDjfX2owDNmQhgojRx5Y4mLEc-KiAeeTJjT0VmKiiIld8UP86AgQPJDK2o6oC6BhTmub4NLZ_MO9-E7l9Q]
 

[ht

Re: Limit on number of keyspaces/tables

2017-03-05 Thread Rakesh Kumar
> I ask back: what's your intention

May be documenting the limitations of Cassandra to show Oracle is better :-)

Am 05.03.2017 11:58 schrieb "Lata Kannan" 
mailto:lata.kan...@oracle.com>>:
^



Re: Which compaction strategy when modeling a dumb set

2017-02-27 Thread Rakesh Kumar
typo: " If yes, it is considered a good practice for Cassandra"
should read as
" If yes, is it considered a good practice for Cassandra ?" 
________
From: Rakesh Kumar 
Sent: Monday, February 27, 2017 10:06
To: user@cassandra.apache.org
Subject: Re: Which compaction strategy when modeling a dumb set

Do you update this table when an event is processed?  If yes, it is considered 
a good practice for Cassandra.  I read somewhere that using Cassandra as a 
queuing table is anti pattern.

From: Vincent Rischmann 
Sent: Friday, February 24, 2017 06:24
To: user@cassandra.apache.org
Subject: Which compaction strategy when modeling a dumb set

Hello,

I'm using a table like this:

   CREATE TABLE myset (id uuid PRIMARY KEY)

which is basically a set I use for deduplication, id is a unique id for an 
event, when I process the event I insert the id, and before processing I check 
if it has already been processed for deduplication.

It works well enough, but I'm wondering which compaction strategy I should use. 
I expect maybe 1% or less of events will end up duplicated (thus not generating 
an insert), so the workload will probably be 50% writes 50% read.

Is LCS a good strategy here or should I stick with STCS ?


Re: Which compaction strategy when modeling a dumb set

2017-02-27 Thread Rakesh Kumar
Do you update this table when an event is processed?  If yes, it is considered 
a good practice for Cassandra.  I read somewhere that using Cassandra as a 
queuing table is anti pattern.

From: Vincent Rischmann 
Sent: Friday, February 24, 2017 06:24
To: user@cassandra.apache.org
Subject: Which compaction strategy when modeling a dumb set

Hello,

I'm using a table like this:

   CREATE TABLE myset (id uuid PRIMARY KEY)

which is basically a set I use for deduplication, id is a unique id for an 
event, when I process the event I insert the id, and before processing I check 
if it has already been processed for deduplication.

It works well enough, but I'm wondering which compaction strategy I should use. 
I expect maybe 1% or less of events will end up duplicated (thus not generating 
an insert), so the workload will probably be 50% writes 50% read.

Is LCS a good strategy here or should I stick with STCS ?


Cassandra version numbering

2017-02-23 Thread Rakesh Kumar
Is ver 3.0.10 same as 3.10.

Cassandra website mentions this: Cassandra 3.10 Changelog

But in other places 3.0.10 is mentioned.


Re: Partition size

2016-09-09 Thread Rakesh Kumar
On Fri, Sep 9, 2016 at 11:46 AM, Mark Curtis  wrote:
> If your partition sizes are over 100MB iirc then you'll normally see
> warnings in your system.log, this will outline the partition key, at least
> in Cassandra 2.0 and 2.1 as I recall.

Has it improved in C* 3.x. What is considered a good partition size in C* 3.x


Re: Cassandra: The Definitive Guide, 2nd Edition

2016-08-02 Thread Rakesh Kumar
I have this book thru early release program by O'Reilly. It is a
must-have book for those who want to learn how C* works under the
hood.


Re: Blog post on Cassandra's inner workings and performance - feedback welcome

2016-07-07 Thread Rakesh Kumar
http://localhost:4000/tutorials/2016/02/29/cassandra-inner-workings-and-how-this-relates-to-performance/

This is not a valid address.

On Thu, Jul 7, 2016 at 9:11 AM, Manuel Kiessling  wrote:
> Hi all,
>
> I'm currently in the process of understanding the inner workings of
> Cassandra with regards to network and local storage mechanisms and
> operations. In order to do so, I've written a blog post about it which is
> now in a "first final" version.
>
> Any feedback, especially corrections regarding misunderstandings on my side,
> would be highly appreciated. The post really represents my very subjective
> view on how Cassandra works under the hood, which makes it prone to errors
> of course.
>
> You can access the current version at
> http://localhost:4000/tutorials/2016/02/29/cassandra-inner-workings-and-how-this-relates-to-performance/
>
> Thanks,
> --
>  Manuel
>


Re: Backup strategy

2016-06-16 Thread Rakesh Kumar
On Thu, Jun 16, 2016 at 7:30 PM, Bhuvan Rawal  wrote:
> 2. Snapshotting : Hardlinks of sstables will get created. This is a very
> fast process and latest data is captured into sstables after flushing
> memtables, snapshots will be created in snapshots directory. But snapshot
> does not provide you the feature to go back to a certain point in time but
> incremental backups give you that feature.

Does that mean that the only point-in-time recovery possible is using
incremental backup. In other words C* does not have a concept of
rolling forward commit logs to a point in time (like RDBMS do). Pls
clarify.  thanks


Time series data with only inserts

2016-05-30 Thread Rakesh Kumar
Let us assume that there is a table which gets only inserts and under
normal circumstances no reads on it. If we assume TTL to be 7 days,
what event
will trigger a compaction/purge of old data if the old data is not in
the mem cache and no session needs it.

thanks.


Re: Why is write failing

2016-03-28 Thread Rakesh Kumar
> This is in my cassandra-topology.properties

my bad. I used wrong file, instead of rackdc properties file.


Why is write failing

2016-03-28 Thread Rakesh Kumar
Cassandra: 3.0.3

I am new to Cassandra.

I am creating a test instance of four nodes, two in each data center.
The idea is to verify that Cassandra can continue with writes even if
one DC is down and we further  lose one machine in the surviving DC.

This is in my cassandra-topology.properties

10.122.66.41=DC1:RAC1
10.122.98.53=DC1:RAC2
10.122.142.218=DC2:RAC1
10.122.142.219=DC2:RAC2

# default for unknown nodes
default=DC2:RAC1

Snitch property in cassandra.yaml
 endpoint_snitch: GossipingPropertyFileSnitch

Keyspace has been defined as follows

CREATE KEYSPACE mytesting
 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': 1, 'DC2': 1}
AND durable_writes = true ;

yet when I insert to a table via cqlsh with no consistency set, I get this error

Unavailable: code=1000 [Unavailable exception] message="Cannot achieve
consistency level ONE" info={'required_replicas': 1, 'alive_replicas':
0, 'consistency': 'ONE'}

I have verified that cassandra is up on four all nodes.

What is going on?

Thanks.


Re: How many nodes do we require

2016-03-25 Thread Rakesh Kumar
On Fri, Mar 25, 2016 at 11:45 AM, Jack Krupansky
 wrote:
> It depends on how much data you have. A single node can store a lot of data,
> but the more data you have the longer a repair or node replacement will
> take. How long can you tolerate for a full repair or node replacement?

At this time, for a foreseeable future, size of data will not be
significant. So we can safely disregard the above as a decision
factor.

>
> Generally, RF=3 is both sufficient and recommended.

Are you telling a SimpleReplication topology with RF=3
or NetworkTopology with RF=3.


taken from:

https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html

"
Three replicas in each data center: This configuration tolerates
either the failure of a one node per replication group at a strong
consistency level of LOCAL_QUORUM or multiple node failures per data
center using consistency level ONE."

In our case, with only 3 nodes in each DC, wouldn't a RF=3 effectively mean ALL.

I will state our requirement clearly:

If we are going with six nodes (3 in each DC), we should be able to
write even with a loss of one DC and loss of one node of the surviving
DC. I am open to hearing what compromise we have to do with the reads
during the time a DC is down. For us write is critical, more than
reads.

May be this is not possible with 6 nodes, and requires more.  Pls advise.


How many nodes do we require

2016-03-25 Thread Rakesh Kumar
We have two data centers. Our requirement is simple

Assuming that we have equal number of nodes in each DC we should be able to run 
with the loss of one DC and loss of at most one node in the surviving DC. Can 
this be achieved with 6 nodes (3 in each). Obviously for that all data must be 
available  in any two nodes. Any pointers on replication factor. 

Thanks

--
Sent from mobile. 

Re: Client drivers

2016-03-24 Thread Rakesh Kumar
> Every language has a different means of working with dependencies.  Some are
> compiled in (java, c), some are pulled in via libraries (python).  You'll
> have to be more specific.

I am interested mainly in C++ and Java.

Thanks.


Client drivers

2016-03-24 Thread Rakesh Kumar
Is it possible to install multiple versions of language drivers on the
client machines. This will be typically useful during an upgrade
process, where by fallback to the old version can be easy.

thanks.


Apache Cassandra's license terms

2016-03-19 Thread Rakesh Kumar
What type of Open source license does Cassandra follow?  If we use
open source Cassandra for a revenue generating product, are we
expected to contribute back our code to the open source.

thanks


Python to type field

2016-03-19 Thread Rakesh Kumar
Hi

I have a type defined as follows

CREATE TYPE etag (
ttype int,
tvalue text
);

And this is used in a col of a table as follows

 evetag list  >

I have the following value in a file
[{ttype: 3 , tvalue: '90A1'}]

This gets inserted via COPY command with no issues.

However when I try to insert the same via a python program which I am
writing. where I prepare and then bind, I get this error while executing

TypeError: Received an argument of invalid type for column "evetag".
Expected: , Got: ; (Received a string for a type that
expects a sequence)

I tried casting the variable in python to list, tuple, but same error.


Questions about Datastax support

2016-03-19 Thread Rakesh Kumar
Few questions:

1 - Has there been an announcement as to when Datastax will stop
  supporting 2.x version. I am aware that the community will stop
  supporting 2.x in Nov 2016. What about support to
  paid customers of Datastax. Will it go beyond Nov.
2 -  Are there any plans by Datastax to start supporting 3.x.
3 -  Is version 3.x recommended for production use.
4 -  What about compatibility of 3.x with the Datastax Development and
   monitoring tools. Like currently opcenter does not work with 3.x.

thanks


Re: Questions about Datastax support

2016-03-18 Thread Rakesh Kumar
> 1. They have a published support policy:
> http://www.datastax.com/support-policy/supported-software

Why is the version number so different from the cassandra community edition.

Take a look at this:
4.8.2Release NotesNov 11, 2015Mar 23, 2016Sep 23, 2017

What is version 4.8.2


Re:

2016-03-15 Thread Rakesh Kumar
How are you trying to insert. Paste your code here.














Re: What is wrong in this token function

2016-03-10 Thread Rakesh Kumar
thanks. that explains it.



-Original Message-
From: Jack Krupansky 
To: user 
Sent: Thu, Mar 10, 2016 5:28 pm
Subject: Re: What is wrong in this token function



>From the doc: "When using the RandomPartitioner or Murmur3Partitioner, 
>Cassandra rows are ordered by the hash of their value and hence the order of 
>rows is not meaningful... The ByteOrdered partitioner arranges tokens the same 
>way as key values, but the RandomPartitioner and Murmur3Partitioner distribute 
>tokens in a completely unordered manner. The token function makes it possible 
>to page through these unordered partitioner results."


See:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html (for 2.1)

https://docs.datastax.com/en/cql/3.3/cql/cql_using/usePaging.html (for 2.2 and 
3.x)







-- Jack Krupansky



On Thu, Mar 10, 2016 at 5:14 PM, Rakesh Kumar  wrote:

I am using default Murmur3.  So are you saying in case of Murmur3 the following 
two queries

select count*)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+' and event_time <= '2016-03-12 
19:05:00+'   ;
and
select count(*)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+')  ;


are not same ?


And yes I am aware of how to change the clustering_key to get the first query. 
This question is more of academic exercise for me.




-Original Message-
From: Jack Krupansky 
To: user 

Sent: Thu, Mar 10, 2016 4:55 pm
Subject: Re: What is wrong in this token function



What partitioner are you using? The default partitioner is not "ordered", so it 
will randomly order the hashes/tokens, so that tokens will not be ordered even 
if your PKs are ordered. You probably want to use customer as your partition 
key and event time as a clustering column - then you can use RDBMS-like WHERE 
conditions to select a slice of the partition.



-- Jack Krupansky



On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar  wrote:


typo: the primary key was (customer_id + event_time )



-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Thu, Mar 10, 2016 4:44 pm
Subject: What is wrong in this token function



C*  3.0.3


I have a table table1 which has the primary key on ((customer_id,event_id)).



I loaded 1.03 million rows from a csv file.


Business case: Show me all events for a given customer in a given time frame


In RDBMS it will be


(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+' and event_time <= '2016-03-12 
19:05:00+'   ;



But C* does not allow >= <= on PKY cols. It suggested token function.


So I did this:


(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+')  ;



I am seeing 75% more rows than what it should be. It should be 99K rows, it 
shows 163K.


I checked the output with the csv file itself.  To double check I loaded the 
csv in another table
with modified PKY so that the first query (Query1) can be executed. It also 
showed 99K rows.


Am I using token function incorrectly ?




















Re: What is wrong in this token function

2016-03-10 Thread Rakesh Kumar
I am using default Murmur3.  So are you saying in case of Murmur3 the following 
two queries

select count*)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+' and event_time <= '2016-03-12 
19:05:00+'   ;
and
select count(*)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+')  ;


are not same ?


And yes I am aware of how to change the clustering_key to get the first query. 
This question is more of academic exercise for me.




-Original Message-
From: Jack Krupansky 
To: user 
Sent: Thu, Mar 10, 2016 4:55 pm
Subject: Re: What is wrong in this token function



What partitioner are you using? The default partitioner is not "ordered", so it 
will randomly order the hashes/tokens, so that tokens will not be ordered even 
if your PKs are ordered. You probably want to use customer as your partition 
key and event time as a clustering column - then you can use RDBMS-like WHERE 
conditions to select a slice of the partition.



-- Jack Krupansky



On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar  wrote:


typo: the primary key was (customer_id + event_time )



-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Thu, Mar 10, 2016 4:44 pm
Subject: What is wrong in this token function



C*  3.0.3


I have a table table1 which has the primary key on ((customer_id,event_id)).



I loaded 1.03 million rows from a csv file.


Business case: Show me all events for a given customer in a given time frame


In RDBMS it will be


(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+' and event_time <= '2016-03-12 
19:05:00+'   ;



But C* does not allow >= <= on PKY cols. It suggested token function.


So I did this:


(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+')  ;



I am seeing 75% more rows than what it should be. It should be 99K rows, it 
shows 163K.


I checked the output with the csv file itself.  To double check I loaded the 
csv in another table
with modified PKY so that the first query (Query1) can be executed. It also 
showed 99K rows.


Am I using token function incorrectly ?















Re: What is wrong in this token function

2016-03-10 Thread Rakesh Kumar

typo: the primary key was (customer_id + event_time )


-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Thu, Mar 10, 2016 4:44 pm
Subject: What is wrong in this token function



C*  3.0.3


I have a table table1 which has the primary key on ((customer_id,event_id)).



I loaded 1.03 million rows from a csv file.


Business case: Show me all events for a given customer in a given time frame


In RDBMS it will be


(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+' and event_time <= '2016-03-12 
19:05:00+'   ;



But C* does not allow >= <= on PKY cols. It suggested token function.


So I did this:


(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+')  ;



I am seeing 75% more rows than what it should be. It should be 99K rows, it 
shows 163K.


I checked the output with the csv file itself.  To double check I loaded the 
csv in another table
with modified PKY so that the first query (Query1) can be executed. It also 
showed 99K rows.


Am I using token function incorrectly ?









What is wrong in this token function

2016-03-10 Thread Rakesh Kumar

C*  3.0.3


I have a table table1 which has the primary key on ((customer_id,event_id)).



I loaded 1.03 million rows from a csv file.


Business case: Show me all events for a given customer in a given time frame


In RDBMS it will be


(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+' and event_time <= '2016-03-12 
19:05:00+'   ;



But C* does not allow >= <= on PKY cols. It suggested token function.


So I did this:


(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+')  ;



I am seeing 75% more rows than what it should be. It should be 99K rows, it 
shows 163K.


I checked the output with the csv file itself.  To double check I loaded the 
csv in another table
with modified PKY so that the first query (Query1) can be executed. It also 
showed 99K rows.


Am I using token function incorrectly ?








Possible bug in Cassandra

2016-03-09 Thread Rakesh Kumar

Cassandra : 3.3
CQLSH  : 5.0.1

If there is a typo in the column name of the copy command, we get this:


copy mytable 
(event_id,event_class_cd,event_ts,receive_ts,event_source_instance,client_id,client_id_type,event_tag,event_udf,client_event_date)
from '/pathtofile.dat'
with DELIMITER = '|' AND NULL = 'NULL' AND DATETIMEFORMAT='%Y-%m-%d 
%H:%M:%S.%f' ;



Starting copy of mytable with columns ['event_id', 'event_class_cd', 
'event_ts', 'receive_ts', 'event_source_instance', 'client_id', 
'client_id_type', 'event_tag', 'event_udf', 'event_client_date'].
  
load_ravitest.cql:5:13 child process(es) died unexpectedly, aborting



the typo was in the name of event_client_date. It should have been 
client_event_date.




Automatically connect to any up node via cqlsh

2016-03-09 Thread Rakesh Kumar

Cassandra : 3.3
CQLSH  : 5.0.1


Is it possible to set up cassandra/cqlsh so that if any node is down, cqlsh 
will automatically try to connect to the other surviving nodes, instead of 
erroring out. I know it is possible to supply ip_address and port of the UP 
node as arguments to cqlsh, but I am looking at automatic detection.


thanks



Re: Querying on index

2016-03-01 Thread Rakesh Kumar
Looks like Bloom filter size was the issue. Once I disabled it, the query 
returns rows correctly, but it was terrible slow (expected since it will hit 
SStable every time).



-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Tue, Mar 1, 2016 4:57 pm
Subject: Re: Querying on index




At this time no one else is using this table. So the data is static.

-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Tue, Mar 1, 2016 4:54 pm
Subject: Querying on index

Cassandra: 3.3On my test system I create a tablecreate table eventinput(   
event_id varchar ,event_class_cd int ,event_ts timestamp ,
client_id varchar ,event_message  text ,primary key 
((client_id,event_id),event_ts)) I created an index on client_id create index 
idx1 on eventinput(client_id);When I query select *from eventinputwhere 
client_id = 'aa' ALLOW filtering ;I get random results. One time it is 200, 
another time 400 or 500 or 600 and sometimes 0.Why ?




Re: Querying on index

2016-03-01 Thread Rakesh Kumar


At this time no one else is using this table. So the data is static.

-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Tue, Mar 1, 2016 4:54 pm
Subject: Querying on index

Cassandra: 3.3On my test system I create a tablecreate table eventinput(   
event_id varchar ,event_class_cd int ,event_ts timestamp ,
client_id varchar ,event_message  text ,primary key 
((client_id,event_id),event_ts)) I created an index on client_id create index 
idx1 on eventinput(client_id);When I query select *from eventinputwhere 
client_id = 'aa' ALLOW filtering ;I get random results. One time it is 200, 
another time 400 or 500 or 600 and sometimes 0.Why ?



Querying on index

2016-03-01 Thread Rakesh Kumar
Cassandra: 3.3

On my test system I create a table

create table eventinput
(
event_id varchar ,
event_class_cd int ,
event_ts timestamp ,
client_id varchar ,
event_message  text ,
primary key ((client_id,event_id),event_ts)
) 

I created an index on client_id 
create index idx1 on eventinput(client_id);

When I query 
select *
from eventinput
where client_id = 'aa' 
ALLOW filtering ;

I get random results. One time it is 200, another time 400 or 500 or 600 and 
sometimes 0.

Why ?





Disable writing to debug.log

2016-03-01 Thread Rakesh Kumar
Version: Cassandra 3.3


Can anyone tell on how to disable writing to debug.log.


thanks.


CRT

2016-02-23 Thread Rakesh Kumar

https://www.aphyr.com/posts/294-jepsen-cassandra


How much of this is still valid in ver 3.0. The above seems to have been 
written for ver 1.0.


thanks.