RE: range queries on partition key supported?
Thank you, Kurt. Just one more clarification. And, then entire partition on each node will be searched based on the > clustering key (i.e. "time" in this case). No. it will skip to the section of the partition with time = '12:00'. Cassandra should be smart enough to avoid reading the whole partition. Yeah, that seems to correct. I probably didn't phrase it correctly. Now let's assume a specific node is selected based on the token range and we need to look up for the data with time='12:00' within the partition which was obviously within token range. Now on this node, there may be more than one partitions (let's take two partitions for example) which qualify for this token range. In that case, these two partitions will need to be looked up to get the data with the given time = 12:00. So I'm wondering how these two partitions will be looked up on this node. How the request query would look like on this node to get these partitions? Does it make sense? Do you think I'm missing something? Thanks, Preetika -Original Message- From: kurt greaves [mailto:k...@instaclustr.com] Sent: Wednesday, January 31, 2018 9:46 PM To: dev@cassandra.apache.org Subject: Re: range queries on partition key supported? > > So that means more than one nodes can be selected to fulfill a range > query based on the token, correct? Yes. When doing a token range query Cassandra will need to send requests to any node that owns part of the token range requested. This could be just one set of replicas or more, depending on how your token ring is arranged. You could avoid querying multiple nodes by limiting the token() calls to be within one token range. And, then entire partition on each node will be searched based on the > clustering key (i.e. "time" in this case). No. it will skip to the section of the partition with time = '12:00'. Cassandra should be smart enough to avoid reading the whole partition. On 31 January 2018 at 06:57, Tyagi, Preetika wrote: > So that means more than one nodes can be selected to fulfill a range > query based on the token, correct? > > I was looking at this link: https://www.datastax.com/dev/ > blog/a-deep-look-to-the-cql-where-clause > > In the example query, > SELECT * FROM numberOfRequests > WHERE token(cluster, date) > token('cluster1', '2015-06-03') > AND token(cluster, date) <= token('cluster1', '2015-06-05') > AND time = '12:00' > > More than one nodes might get picked for this token based range query. > And, then entire partition on each node will be searched based on the > clustering key (i.e. "time" in this case). > Is my understanding correct? > > Thanks, > Preetika > > -Original Message- > From: J. D. Jordan [mailto:jeremiah.jor...@gmail.com] > Sent: Tuesday, January 30, 2018 10:13 AM > To: dev@cassandra.apache.org > Subject: Re: range queries on partition key supported? > > A range query can be performed on the token of a partition key, not on > the value. > > -Jeremiah > > > On Jan 30, 2018, at 12:21 PM, Tyagi, Preetika > > > wrote: > > > > Hi All, > > > > I have a quick question on Cassandra's behavior in case of partition > keys. I know that range queries are allowed in general, however, is it > also allowed on partition keys as well? The partition key is used as > an input to determine a node in a cluster, so I'm wondering how one > can possibly perform range query on that. > > > > Thanks, > > Preetika > > > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >
RE: create branch in my github account
Thank you Michael. I was able to create the branch and push my changes! :) Preetika -Original Message- From: Michael Shuler [mailto:mshu...@pbandjelly.org] On Behalf Of Michael Shuler Sent: Tuesday, January 30, 2018 2:04 PM To: dev@cassandra.apache.org Subject: Re: create branch in my github account On 01/30/2018 03:47 PM, Tyagi, Preetika wrote: > Hi all, > > I'm working on the JIRA ticket CASSANDRA-13981 and pushed a patch > yesterday, however, I have been suggested to create a branch in my > github account and then push all changes into that. The patch is too > big hence this seems to be a better approach. I haven't done it before > so wanted to ensure I do it correctly without messing things up :) > > > 1. On Cassandra GitHub: https://github.com/apache/cassandra, > click on "Fork" to create my own copy in my account. > > 2. Git clone on the forked branch above s/branch/repository/ - this is a new forked repo, not a branch > 3. Git checkout git checkout trunk # since 13981 appears to for 4.0 (trunk) # if you worked off some random sha, you may need to rebase on # trunk HEAD, otherwise it may not cleanly merge and that will be # the first patch review request. git checkout -b CASSANDRA-13981 # create a new branch > 4. Apply my patch > > 5. Git commit -m "" > > 6. Git push origin trunk git push origin CASSANDRA-13981 # push a new branch to your fork > Please let me know if you notice any issues. Thanks for your help! You could do this in your fork on the trunk repository, but it's probably better to create a new branch, so you can fetch changes from the upstream trunk branch and rebase your branch, if that is needed. It is very common to have a number of remotes configured in your local repository: one for your fork, one for the apache upstream, ones for other user's forks, etc. If you do your work directly in your trunk branch, you'll have conflicts when pulling in new commits from apache/cassandra trunk, for example. -- Michael - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: range queries on partition key supported?
> > So that means more than one nodes can be selected to fulfill a range query > based on the token, correct? Yes. When doing a token range query Cassandra will need to send requests to any node that owns part of the token range requested. This could be just one set of replicas or more, depending on how your token ring is arranged. You could avoid querying multiple nodes by limiting the token() calls to be within one token range. And, then entire partition on each node will be searched based on the > clustering key (i.e. "time" in this case). No. it will skip to the section of the partition with time = '12:00'. Cassandra should be smart enough to avoid reading the whole partition. On 31 January 2018 at 06:57, Tyagi, Preetika wrote: > So that means more than one nodes can be selected to fulfill a range query > based on the token, correct? > > I was looking at this link: https://www.datastax.com/dev/ > blog/a-deep-look-to-the-cql-where-clause > > In the example query, > SELECT * FROM numberOfRequests > WHERE token(cluster, date) > token('cluster1', '2015-06-03') > AND token(cluster, date) <= token('cluster1', '2015-06-05') > AND time = '12:00' > > More than one nodes might get picked for this token based range query. > And, then entire partition on each node will be searched based on the > clustering key (i.e. "time" in this case). > Is my understanding correct? > > Thanks, > Preetika > > -Original Message- > From: J. D. Jordan [mailto:jeremiah.jor...@gmail.com] > Sent: Tuesday, January 30, 2018 10:13 AM > To: dev@cassandra.apache.org > Subject: Re: range queries on partition key supported? > > A range query can be performed on the token of a partition key, not on the > value. > > -Jeremiah > > > On Jan 30, 2018, at 12:21 PM, Tyagi, Preetika > wrote: > > > > Hi All, > > > > I have a quick question on Cassandra's behavior in case of partition > keys. I know that range queries are allowed in general, however, is it also > allowed on partition keys as well? The partition key is used as an input to > determine a node in a cluster, so I'm wondering how one can possibly > perform range query on that. > > > > Thanks, > > Preetika > > > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >
Cassandra Monthly Dev Roundup: Jan 2018 Edition
Happy 2018 Cassandra Developers, I hope you all had a good holiday season. In going through some of the tickets/emails, I'm pretty happy - we had some contributions from some big and interesting companies I didn't even realize were using Cassandra, and that's always fun to see [1]. If you haven't had time to keep up with hot issues this month, there's a few hot topics that will cause us to issue a release in the very near future: 1) https://issues.apache.org/jira/browse/CASSANDRA-14092 - We store TTLs as 32 bit ints, we cap users at 20 year TTLs. If you set a TTL to 20 years, that started to overflow the 32 bit int not long ago. That's bad. Different versions have different impact, from annoying to very bad. We'll probably cut a release as soon as this is done. There's some active conversation in the list and on that JIRA - you should read it if you care about how we handle data when we find a negative timestamp on disk (read: there's some disagreement, if you have an opinion, chime in). 2) https://issues.apache.org/jira/browse/CASSANDRA-14173 - The JMX auth stuff used some JDK internals. Those JDK internals changed with JDK8u161. Sam has a new patch, ready to commit. This probably will get more and more attention as more and more people upgrade to the newest JDK and find out Cassandra doesnt start In terms of big / interesting commits that landed since the last email: CASSANDRA-7544 Configurable storage port per node. Huge patch, you probably care about this if you ever tried to run multiple instances of cassandra on one IP (like on a laptop), or on different ports in a given cluster (port 7000 on some hosts, and 7001 on others), or similar. CASSANDRA-14134 upgraded dtests to python3, getting rid of old dependencies on pycassa (unmaintained), an ancient version of thrift, etc. Another huge patch, if you're developing locally and running dtests yourself, you now need python3. Some extra good news - docs are now much improved. CASSANDRA-14190 is a patch from a new contributor that did something most operators probably really wish existed 10 years ago - "nodetool reloadseeds". Really should have existed long ago. CASSANDRA-9067 speed up bloom filter serialization by 3-7x CASSANDRA-13867 isn't flashy, but is another step in making more things immutable for safety - huge patch for PartitionUpdate and Mutation, for those of you who pay attention to the deep, dark internals. On the mailing list, a user asked about plans for CDC. If you have an opinion, it's not too late to chime in: https://lists.apache.org/thread.html/aaa82c7dab534c3a35cfd1c4a082cb3a8f6bbf97e3efe960fa2342d0@%3Cdev.cassandra.apache.org%3E Patches that could use reviews: - https://issues.apache.org/jira/browse/CASSANDRA-14205 (Missing CQL reserved keywords) - https://issues.apache.org/jira/browse/CASSANDRA-14201 (new options to nodetool verify) - https://issues.apache.org/jira/browse/CASSANDRA-14204 (nodetool garbagecollect assertion error) - https://issues.apache.org/jira/browse/CASSANDRA-13981 (changes for running on systems with persistent memory) - https://issues.apache.org/jira/browse/CASSANDRA-14197 (more automatic upgradesstables) - https://issues.apache.org/jira/browse/CASSANDRA-14176 (2 line python fix for making COPY work) - https://issues.apache.org/jira/browse/CASSANDRA-14102 (transparent data encryption) - https://issues.apache.org/jira/browse/CASSANDRA-14107 (key rotation for transparent data encryption) - https://issues.apache.org/jira/browse/CASSANDRA-14160 (speeding up compaction by keeping overlapping sstables ordered by time) - https://issues.apache.org/jira/browse/CASSANDRA-12763 (make compaction much faster for cases with lots of sstables) - https://issues.apache.org/jira/browse/CASSANDRA-14126 (fixing javascript UDFs) - https://issues.apache.org/jira/browse/CASSANDRA-14070 (exposing primary key column values in a different way) I'd like to pretend that that's all the patch-available-needing-review tickets, but I'd be lying - there's a LOT of patches waiting for reviews. If you're able, please review a ticket this week. I'll personally buy you a drink next time I bump into you if you do it and remind me about it. Until February, - Jeff Footnote 1: I'm super tempted to name them, but I know some companies don't like the attention, and I don't want everyone to feel like they have to post with personal emails.
Re: CDC usability and future development
> > CDC provides only the mutation as opposed to the full column value, which > tends to be of limited use for us. Applications might want to know the full > column value, without having to issue a read back. We also see value in > being able to publish the full column value both before and after the > update. This is especially true when deleting a column since this stream > may be joined with others, or consumers may require other fields to > properly process the delete. Philosophically, my first pass at the feature prioritized minimizing impact to node performance first and usability second, punting a lot of the de-duplication and RbW implications of having full column values, or materializing stuff off-heap for consumption from a user and flagging as persisted to disk etc, for future work on the feature. I don't personally have any time to devote to moving the feature forward now but as Jeff indicates, Jay and Simon are both active in the space and taking up the torch. On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa wrote: > Here's a deck of some proposed additions, discussed at one of the NGCC > sessions last fall: > > https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf > > > > On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme wrote: > > > Hi all, > > > > We are currently designing a system that allows our Cassandra clusters to > > produce a stream of data updates. Naturally, we have been evaluating if > CDC > > can aid in this endeavor. We have found several challenges in using CDC > for > > this purpose. > > > > CDC provides only the mutation as opposed to the full column value, which > > tends to be of limited use for us. Applications might want to know the > full > > column value, without having to issue a read back. We also see value in > > being able to publish the full column value both before and after the > > update. This is especially true when deleting a column since this stream > > may be joined with others, or consumers may require other fields to > > properly process the delete. > > > > Additionally, there is some difficulty with processing CDC itself such > as: > > - Updates not being immediately available (addressed by CASSANDRA-12148) > > - Each node providing an independent streams of updates that must be > > unified and deduplicated > > > > Our question is, what is the vision for CDC development? The current > > implementation could work for some use cases, but is a ways from a > general > > streaming solution. I understand that the nature of Cassandra makes this > > quite complicated, but are there any thoughts or desires on the future > > direction of CDC? > > > > Thanks > > > > >