Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering , the only thing you can do will be from your data model . Don’t ask Cassandra to query all data from table but the ideal query will be using single partition.
On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic < stefan.mikloso...@instaclustr.com> wrote: > Hi Sean, > > for sure, the best approach would be to create another table which would > treat just that specific query. > > How do I set the flag for not allowing allow filtering in cassandra.yaml? > I read a doco and there seems to be nothing about that. > > Regards > > On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <sean_r_dur...@homedepot.com> > wrote: > >> If there are 2 access patterns, I would consider having 2 tables. The >> first one with the ID, which you say is the majority use case. Then have a >> second table that uses a time-bucket approach as others have suggested: >> >> (time bucket, id) as primary key >> >> Choose a time bucket (day, week, hour, month, whatever) that would hold >> less than 100 MB of data in the time-bucket partition. >> >> >> >> You could include all relevant data in the second table to meet your >> query. OR, if that data seems too large or too volatile to duplicate, just >> include your primary key and look-up the data in the primary table as >> needed. >> >> >> >> If you use allow filtering, you are setting yourself up for failure to >> scale. I tell my developers, “if you use allow filtering, you are doing it >> wrong.” In fact, I think the Cassandra admin should be able to set a flag >> in cassandra.yaml to not allow filtering at all. The cluster should be able >> to protect itself from bad queries. >> >> >> >> >> >> >> >> *From:* Leena Ghatpande <lghatpa...@hotmail.com> >> *Sent:* Tuesday, March 12, 2019 9:02 AM >> *To:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com>; >> user@cassandra.apache.org >> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to >> another table within the same cluster when COPY is not an option. >> >> >> >> Our data model cannot be like below as you have recommended as majority >> of the reads need to select the data by the partition key (id) only, not by >> date. >> >> You could remodel your data in such way that you would make primary key >> like this >> >> ((date), hour-minute, id) >> >> or >> >> ((date, hour-minute), id) >> >> >> >> >> >> By adding the date as clustering column, yes the idea was to use the >> Allow Filtering on the date and pull the records. Understand that it is not >> recommended to do this, but we have been doing this on another existing >> large table and have not run into any issue so far. But want to understand >> if there is a better approach to this? >> >> >> >> Thanks >> >> >> ------------------------------ >> >> *From:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com> >> *Sent:* Monday, March 11, 2019 7:12 PM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Migrate large volume of data from one table to another >> table within the same cluster when COPY is not an option. >> >> >> >> The query which does not work should be like this, I made a mistake there >> >> >> >> cqlsh> SELECT * from my_keyspace.my_table where number > 2; >> >> InvalidRequest: Error from server: code=2200 [Invalid query] >> message="Cannot execute this query as it might involve data filtering and >> thus may have unpredictable performance. If you want to execute this query >> despite the performance unpredictability, use ALLOW FILTERING" >> >> >> >> >> >> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic < >> stefan.mikloso...@instaclustr.com> wrote: >> >> Hi Leena, >> >> >> >> "We are thinking of creating a new table with a date field as a >> clustering column to be able to query for date ranges, but partition key to >> clustering key will be 1-1. Is this a good approach?" >> >> >> >> If you want to select by some time range here, I am wondering how would >> making datetime a clustering column help you here? You still have to >> provide primary key, right? >> >> >> >> E.g. select * from your_keyspace.your_table where id=123 and my_date > >> yesterday and my_date < tomorrow (you got the idea) >> >> >> >> If you make my_date clustering column, you cant not do this below, >> because you still have to specify partition key fully and then clustering >> key (optionally) where you can further order and do ranges. But you cant do >> a query without specifying partition key. Well, you can use ALLOW FILTERING >> but you do not want to do this at all in your situation as it would scan >> everything. >> >> >> >> select * from your_keyspace.your_table where my_date > yesterday and >> my_date < tomorrow >> >> >> >> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': >> 'NetworkTopologyStrategy', 'dc1': '1'}; >> >> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY >> KEY ((id), number)); >> >> >> >> cqlsh> SELECT * from my_keyspace.my_table ; >> >> >> >> id | number >> >> --------------------------------------+-------- >> >> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f | 3 >> >> abdc0184-a695-427d-b63b-57cdf7a45f00 | 1 >> >> 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 | 4 >> >> 8cff3eb7-1aff-4dc7-9969-60190c7e4675 | 2 >> >> >> >> cqlsh> SELECT * from my_keyspace.my_table where id = >> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and number > 2; >> >> InvalidRequest: Error from server: code=2200 [Invalid query] >> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for >> "id" of type uuid" >> >> >> >> cqlsh> SELECT * from my_keyspace.my_table where id = >> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and number > 2; >> >> >> >> id | number >> >> --------------------------------------+-------- >> >> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f | 3 >> >> >> >> You could remodel your data in such way that you would make primary key >> like this >> >> >> >> ((date), hour-minute, id) >> >> >> >> or >> >> >> >> ((date, hour-minute), id) >> >> >> >> I would prefer the second one because if you expect a lot of data per >> day, they would all end up on same set of replicas as hash of partition key >> would be same whole day if you have same date all day so I think you would >> end up with hotspots. You want to have your data spread more evenly so the >> second one seems to be better to me. >> >> >> >> You can also investigate how to do this with materialized view but I am >> not sure about the performance here. >> >> >> >> If you want to copy data you can do this e.g. by Cassandra Spark >> connector, you would just read table and as you read it you would write to >> another one. That is imho the fastest approach and the least error prone. >> You can do that on live production data and you can just make a "switch" >> afterwards. Not sure about ttls but that should be transparent while >> copying that. >> >> >> >> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lghatpa...@hotmail.com> >> wrote: >> >> We have a table with over 70M rows with a partition key that is unique. >> We have a created datetime stamp on each record, and we have a need to >> select all rows created for a date range. Secondary index is not an option >> as its high cardinality and could slow performance doing a full scan on 70M >> rows. >> >> >> >> We are thinking of creating a new table with a date field as a clustering >> column to be able to query for date ranges, but partition key to clustering >> key will be 1-1. Is this a good approach? >> >> To do this, we need to copy this large volume of data from table1 to >> table2 within the same cluster, while updates are still happening to >> table1. We need to do this real time without impacting our customers. COPY >> is not an option, as we have ttl's on each row on table1 that need to be >> applied to table2 as well. >> >> >> >> So what would be the best approach >> >> 1. To be able select data using date range without impacting >> performance. This operation will be needed only on adhoc basis and it wont >> be as frequent . >> 2. Best way to migrate large volume of data with ttl from one table >> to another within the same cluster. >> >> >> >> Any other suggestions also will be greatly appreciated. >> >> >> >> >> >> >> Stefan Miklosovic >> >> >> >> Stefan Miklosovic >> >> ------------------------------ >> >> The information in this Internet Email is confidential and may be legally >> privileged. It is intended solely for the addressee. Access to this Email >> by anyone else is unauthorized. If you are not the intended recipient, any >> disclosure, copying, distribution or any action taken or omitted to be >> taken in reliance on it, is prohibited and may be unlawful. When addressed >> to our clients any opinions or advice contained in this Email are subject >> to the terms and conditions expressed in any applicable governing The Home >> Depot terms of business or client engagement letter. The Home Depot >> disclaims all responsibility and liability for the accuracy and content of >> this attachment and for any damages or losses arising from any >> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >> items of a destructive nature, which may be contained in this attachment >> and shall not be liable for direct, indirect, consequential or special >> damages in connection with this e-mail message or its attachment. >> > > Stefan Miklosovic > > -- Best regards _____________________________________________________________ [image: https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour] <https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour> <https://twitter.com/dmnbigdata> <https://www.instagram.com/> <https://www.linkedin.com/in/dngaya/> *Dieudonne Madishon NGAYA* Datastax, Cassandra Architect *P: *7048580065 *w: *www.dmnbigdata.com *E: *dmng...@dmnbigdata.com *Private E: *dmng...@gmail.com *A: *Charlotte,NC,28273, USA