Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Dieudonné Madishon NGAYA Tue, 12 Mar 2019 16:17:39 -0700

Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering ,
the only thing you can do will be from your data model .
Don’t ask Cassandra to query all data from table but the ideal query will
be using single partition.


On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> wrote:

> Hi Sean,
>
> for sure, the best approach would be to create another table which would
> treat just that specific query.
>
> How do I set the flag for not allowing allow filtering in cassandra.yaml?
> I read a doco and there seems to be nothing about that.
>
> Regards
>
> On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <sean_r_dur...@homedepot.com>
> wrote:
>
>> If there are 2 access patterns, I would consider having 2 tables. The
>> first one with the ID, which you say is the majority use case.  Then have a
>> second table that uses a time-bucket approach as others have suggested:
>>
>> (time bucket, id) as primary key
>>
>> Choose a time bucket (day, week, hour, month, whatever) that would hold
>> less than 100 MB of data in the time-bucket partition.
>>
>>
>>
>> You could include all relevant data in the second table to meet your
>> query. OR, if that data seems too large or too volatile to duplicate, just
>> include your primary key and look-up the data in the primary table as
>> needed.
>>
>>
>>
>> If you use allow filtering, you are setting yourself up for failure to
>> scale. I tell my developers, “if you use allow filtering, you are doing it
>> wrong.” In fact, I think the Cassandra admin should be able to set a flag
>> in cassandra.yaml to not allow filtering at all. The cluster should be able
>> to protect itself from bad queries.
>>
>>
>>
>>
>>
>>
>>
>> *From:* Leena Ghatpande <lghatpa...@hotmail.com>
>> *Sent:* Tuesday, March 12, 2019 9:02 AM
>> *To:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com>;
>> user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
>> another table within the same cluster when COPY is not an option.
>>
>>
>>
>> Our data model cannot be like below as you have recommended as majority
>> of the reads need to select the data by the partition key (id) only, not by
>> date.
>>
>> You could remodel your data in such way that you would make primary key
>> like this
>>
>> ((date), hour-minute, id)
>>
>> or
>>
>> ((date, hour-minute), id)
>>
>>
>>
>>
>>
>> By adding the date as clustering column, yes the idea was to use the
>> Allow Filtering on the date and pull the records. Understand that it is not
>> recommended to do this, but we have been doing this on another existing
>> large table and have not run into any issue so far. But want to understand
>> if there is a better approach to this?
>>
>>
>>
>> Thanks
>>
>>
>> ------------------------------
>>
>> *From:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com>
>> *Sent:* Monday, March 11, 2019 7:12 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Migrate large volume of data from one table to another
>> table within the same cluster when COPY is not an option.
>>
>>
>>
>> The query which does not work should be like this, I made a mistake there
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
>>
>> InvalidRequest: Error from server: code=2200 [Invalid query]
>> message="Cannot execute this query as it might involve data filtering and
>> thus may have unpredictable performance. If you want to execute this query
>> despite the performance unpredictability, use ALLOW FILTERING"
>>
>>
>>
>>
>>
>> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
>> stefan.mikloso...@instaclustr.com> wrote:
>>
>> Hi Leena,
>>
>>
>>
>> "We are thinking of creating a new table with a date field as a
>> clustering column to be able to query for date ranges, but partition key to
>> clustering key will be 1-1. Is this a good approach?"
>>
>>
>>
>> If you want to select by some time range here, I am wondering how would
>> making datetime a clustering column help you here? You still have to
>> provide primary key, right?
>>
>>
>>
>> E.g. select * from your_keyspace.your_table where id=123 and my_date >
>> yesterday and my_date < tomorrow (you got the idea)
>>
>>
>>
>> If you make my_date clustering column, you cant not do this below,
>> because you still have to specify partition key fully and then clustering
>> key (optionally) where you can further order and do ranges. But you cant do
>> a query without specifying partition key. Well, you can use ALLOW FILTERING
>> but you do not want to do this at all in your situation as it would scan
>> everything.
>>
>>
>>
>> select * from your_keyspace.your_table where my_date > yesterday and
>> my_date < tomorrow
>>
>>
>>
>> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
>> 'NetworkTopologyStrategy', 'dc1': '1'};
>>
>> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY
>> KEY ((id), number));
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table ;
>>
>>
>>
>>  id                                   | number
>>
>> --------------------------------------+--------
>>
>>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>>
>>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>>
>>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>>
>>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table where id =
>> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
>>
>> InvalidRequest: Error from server: code=2200 [Invalid query]
>> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
>> "id" of type uuid"
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table where id =
>> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>>
>>
>>
>>  id                                   | number
>>
>> --------------------------------------+--------
>>
>>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>>
>>
>>
>> You could remodel your data in such way that you would make primary key
>> like this
>>
>>
>>
>> ((date), hour-minute, id)
>>
>>
>>
>> or
>>
>>
>>
>> ((date, hour-minute), id)
>>
>>
>>
>> I would prefer the second one because if you expect a lot of data per
>> day, they would all end up on same set of replicas as hash of partition key
>> would be same whole day if you have same date all day so I think you would
>> end up with hotspots. You want to have your data spread more evenly so the
>> second one seems to be better to me.
>>
>>
>>
>> You can also investigate how to do this with materialized view but I am
>> not sure about the performance here.
>>
>>
>>
>> If you want to copy data you can do this e.g. by Cassandra Spark
>> connector, you would just read table and as you read it you would write to
>> another one. That is imho the fastest approach and the least error prone.
>> You can do that on live production data and you can just make a "switch"
>> afterwards. Not sure about ttls but that should be transparent while
>> copying that.
>>
>>
>>
>> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lghatpa...@hotmail.com>
>> wrote:
>>
>> We have a table with over 70M rows with a partition key that is unique.
>> We have a  created datetime stamp on each record, and we have a need to
>> select all rows created for a date range. Secondary index is not an option
>> as its high cardinality and could slow performance doing a full scan on 70M
>> rows.
>>
>>
>>
>> We are thinking of creating a new table with a date field as a clustering
>> column to be able to query for date ranges, but partition key to clustering
>> key will be 1-1. Is this a good approach?
>>
>> To do this, we need to copy this large volume of data from table1 to
>> table2 within the same cluster, while updates are still happening to
>> table1. We need to do this real time without impacting our customers. COPY
>> is not an option, as we have ttl's on each row on table1 that need to be
>> applied to table2 as well.
>>
>>
>>
>> So what would be the best approach
>>
>>    1. To be able select data using date range without impacting
>>    performance. This operation will be needed only on adhoc basis and it wont
>>    be as frequent .
>>    2. Best way to migrate large volume of data with ttl from one table
>>    to another within the same cluster.
>>
>>
>>
>> Any other suggestions also will be greatly appreciated.
>>
>>
>>
>>
>>
>>
>> Stefan Miklosovic
>>
>>
>>
>> Stefan Miklosovic
>>
>> ------------------------------
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
> Stefan Miklosovic
>
> --

Best regards
_____________________________________________________________

[image:
https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
<https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour>
   <https://twitter.com/dmnbigdata>   <https://www.instagram.com/>
<https://www.linkedin.com/in/dngaya/>

*Dieudonne Madishon NGAYA*
Datastax, Cassandra Architect
*P: *7048580065
*w: *www.dmnbigdata.com
*E: *dmng...@dmnbigdata.com
*Private E: *dmng...@gmail.com
*A: *Charlotte,NC,28273, USA

Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Reply via email to