Re: Five Questions for Cassandra Users
1. Do the same people where you work operate the cluster and write the code to develop the application? Different teams. Infra separate , Dev separate. 2. Do you have a metrics stack that allows you to see graphs of various metrics with all the nodes displayed together? Use third party APM tool to monitor cluster 3. Do you have a log stack that allows you to see the logs for all the nodes together? No.Would like to. 4. Do you regularly repair your clusters - such as by using Reaper? Yes 5. Do you use artificial intelligence to help manage your clusters? No On Thu, 28 Mar, 2019, 2:33 PM Kenneth Brotman, wrote: > I’m looking to get a better feel for how people use Cassandra in > practice. I thought others would benefit as well so may I ask you the > following five questions: > > > > 1. Do the same people where you work operate the cluster and write > the code to develop the application? > > > > 2. Do you have a metrics stack that allows you to see graphs of > various metrics with all the nodes displayed together? > > > > 3. Do you have a log stack that allows you to see the logs for all > the nodes together? > > > > 4. Do you regularly repair your clusters - such as by using Reaper? > > > > 5. Do you use artificial intelligence to help manage your clusters? > > > > > > Thank you for taking your time to share this information! > > > > Kenneth Brotman >
Re: Tombstone
The Partition key is made of datetime(basically date truncated to hour) and bucket.I think your RCA may be correct since we are deleting the partition rows one by one not in a batch files maybe overlapping for the particular partition.A scheduled thread picks the rows for a partition based on current datetime and bucket number and checks whether for each row the entiry is past due or not, if yes we trigger a event and remove the entry. On Tue 19 Jun, 2018, 7:58 PM Jeff Jirsa, wrote: > The most likely explanation is tombstones in files that won’t be collected > as they potentially overlap data in other files with a lower timestamp > (especially true if your partition key doesn’t change and you’re writing > and deleting data within a partition) > > -- > Jeff Jirsa > > > > On Jun 19, 2018, at 3:28 AM, Abhishek Singh wrote: > > > > Hi all, > >We using Cassandra for storing events which are time series > based for batch processing once a particular batch based on hour is > processed we delete the entries but we were left with almost 18% deletes > marked as Tombstones. > > I ran compaction on the particular CF tombstone didn't > come down. > > Can anyone suggest what is the optimal tunning/recommended > practice used for compaction strategy and GC_grace period with 100k entries > and deletes every hour. > > > > Warm Regards > > Abhishek Singh > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >
Tombstone
Hi all, We using Cassandra for storing events which are time series based for batch processing once a particular batch based on hour is processed we delete the entries but we were left with almost 18% deletes marked as Tombstones. I ran compaction on the particular CF tombstone didn't come down. Can anyone suggest what is the optimal tunning/recommended practice used for compaction strategy and GC_grace period with 100k entries and deletes every hour. Warm Regards Abhishek Singh
Avoiding Data Duplication
Hello! I have a column family to log in data coming from my GPS devices. CREATE TABLE log( imei ascii, date ascii, dtime timestamp, data ascii, stime timestamp, PRIMARY KEY ((imei, date), dtime)) WITH CLUSTERING ORDER BY (dtime DESC) ; It is the standard schema for modeling time series data where imei is the unique ID associated with each GPS device date is the date taken from dtime dtime is the date-time coming from the device data is all the latitude, longitude etc that the device is sending us stime is the date-time stamp of the server The reason why I put dtime in the primary key as the clustering column is because most of our queries are done on device time. There can be a delay of a few minutes to a few hours (or a few days! in rare cases) between dtime and stime if the network is not available. However, now we want to query on server time as well for the purpose of debugging. These queries will be not as common as queries on device time. Say for every 100 queries on dtime there will be just 1 query on stime. What options do I have? 1. Seconday Index - not possible because stime is a timestamp and CQL does not allow me to put or in the query for secondary index 2. Data duplication - I can build another column family where I will index by stime but that means I am storing twice as much data. I know everyone says that write operations are cheap and storage is cheap but how? If I have to buy twice as many machines on AWS EC2 each with their own ephemeral storage, then my bill doubles up! Any other ideas I can try? Many Thanks, Abhishek
query contains IN on the partition key and an ORDER BY
Hi I have run into the following issue https://issues.apache.org/jira/browse/CASSANDRA-6722 when running a query (contains IN on the partition key and an ORDER BY ) using datastax driver for Java. However, I am able to run this query alright in cqlsh. cqlsh: show version; [cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3] cqlsh:gps select * from log where imeih in ('862170011627815@2015-01-29 @03','862170011627815@2015-01-30@21','862170011627815@2015-01-30@04') and dtime '2015-01-30 23:59:59' order by dtime desc limit 1; The same query when run via datastax Java driver gives the following error: Exception in thread main com.datastax.driver.core.exceptions.InvalidQueryException: Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35) Any ideas? Thanks, Abhishek.