Re: Optimal backup strategy
Same topology means the restore node should got the same tokes with the backup nodes ; ex : backup node1(1/2/3/4/5) node2(6/7/8/9/10) restore : nodea(1/2/3/4/5) nodeb(6/7/8/9/10) so node1's commitlog can be replay on nodea . Adarsh Kumar 于2019年11月29日周五 下午2:03写道: > Thanks Ahu and Hussein, > > So my understanding is: > >1. Commit log backup is not documented for Apache Cassandra, hence not >standard. But can be used for restore on the same machine (For taking >backup from commit_log_dir). If used on other machine(s) has to be in the >same topology. Can it be used for replacement node? >2. For periodic backup Snapshot+Incremental backup is the best option > > > Thanks, > Adarsh Kumar > > On Fri, Nov 29, 2019 at 7:28 AM guo Maxwell wrote: > >> Hossein is right , But for use , we restore to the same cassandra >> topology ,So it is usable to do replay .But when restore to the >> same machine it is also usable . >> Using sstableloader cost too much time and more storage(though will >> reduce after restored) >> >> Hossein Ghiyasi Mehr 于2019年11月28日周四 下午7:40写道: >> >>> commitlog backup isn't usable in another machine. >>> Backup solution depends on what you want to do: periodic backup or >>> backup to restore on other machine? >>> Periodic backup is combine of snapshot and incremental backup. Remove >>> incremental backup after new snapshot. >>> Take backup to restore on other machine: You can use snapshot after >>> flushing memtable or Use sstableloader. >>> >>> >>> >>> VafaTech.com - A Total Solution for Data Gathering & Analysis >>> >>> On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell >>> wrote: >>> for cassandra or datastax's documentation, commitlog's backup is not mentioned. only snapshot and incremental backup is described to do backup . Though commitlog's archive for keyspace/table is not support but commitlog' replay (though you must put log to commitlog_dir and restart the process) support the feature of keyspace/table' replay filter (using -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to replay the specified keyspace/table) Snapshot do affect the storage, for us we got snapshot one week a time under the low business peak and making snapshot got throttle ,for you you may see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019) Adarsh Kumar 于2019年11月28日周四 上午1:00写道: > Thanks Guo and Eric for replying, > > I have some confusions about commit log backup: > >1. commit log archival technique is ( > > https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore- >) as good as an incremental backup, as it also captures commit logs > after >memtable flush. >2. If we go for "Snapshot + Incremental bk + Commit log", here we >have to take commit log from commit log directory (is there any SOP for >this?). As commit logs are not per table or ks, we will have chalange > in >restoring selective tables. >3. Snapshot based backups are easy to manage and operate due to >its simplicity. But they are heavy on storage. Any views on this? >4. Please share any successful strategy that someone is using for >production. We are still in the design phase and want to implement the > best >solution. > > Thanks Eric for sharing link for medusa. > > Regards, > Adarsh Kumar > > On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell > wrote: > >> For me, I think the last one : >> Snapshot + Incremental + commitlog >> is the most meaningful way to do backup and restore, when you make >> the data backup to some where else like AWS S3. >> >>- Snapshot based backup // for incremental data will not be >>backuped and may lose data when restore to the time latter than >> snapshot >>time; >>- Incremental backups // better than snapshot backup .but >>with Insufficient data accuracy. For data remain in the memtable will >> be >>lose; >>- Snapshot + incremental >>- Snapshot + commitlog archival // better data precision than >>made incremental backup, but the data in the non archived >> commitlog(not >>archive and commitlog log not closed) will not restore and will lose. >> Also >>when log is too much, do log reply will cost very mucu time >> >> For me ,We use snapshot + incremental + commitlog archive. We read >> snapshot data and incremental data .Also the log is backuped .But we will >> not backup the >> log whose data have been flush to sstable ,for the data will be >> backuped by the way we do incremental backup . >> >> This way , the data will exist in the format of sstable trough >>
Re: Optimal backup strategy
Thanks Ahu and Hussein, So my understanding is: 1. Commit log backup is not documented for Apache Cassandra, hence not standard. But can be used for restore on the same machine (For taking backup from commit_log_dir). If used on other machine(s) has to be in the same topology. Can it be used for replacement node? 2. For periodic backup Snapshot+Incremental backup is the best option Thanks, Adarsh Kumar On Fri, Nov 29, 2019 at 7:28 AM guo Maxwell wrote: > Hossein is right , But for use , we restore to the same cassandra topology > ,So it is usable to do replay .But when restore to the > same machine it is also usable . > Using sstableloader cost too much time and more storage(though will reduce > after restored) > > Hossein Ghiyasi Mehr 于2019年11月28日周四 下午7:40写道: > >> commitlog backup isn't usable in another machine. >> Backup solution depends on what you want to do: periodic backup or backup >> to restore on other machine? >> Periodic backup is combine of snapshot and incremental backup. Remove >> incremental backup after new snapshot. >> Take backup to restore on other machine: You can use snapshot after >> flushing memtable or Use sstableloader. >> >> >> >> VafaTech.com - A Total Solution for Data Gathering & Analysis >> >> On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell wrote: >> >>> for cassandra or datastax's documentation, commitlog's backup is not >>> mentioned. >>> only snapshot and incremental backup is described to do backup . >>> >>> Though commitlog's archive for keyspace/table is not support but >>> commitlog' replay (though you must put log to commitlog_dir and restart the >>> process) >>> support the feature of keyspace/table' replay filter (using >>> -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to >>> replay the specified keyspace/table) >>> >>> Snapshot do affect the storage, for us we got snapshot one week a time >>> under the low business peak and making snapshot got throttle ,for you you >>> may >>> see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019) >>> >>> >>> >>> Adarsh Kumar 于2019年11月28日周四 上午1:00写道: >>> Thanks Guo and Eric for replying, I have some confusions about commit log backup: 1. commit log archival technique is ( https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore- ) as good as an incremental backup, as it also captures commit logs after memtable flush. 2. If we go for "Snapshot + Incremental bk + Commit log", here we have to take commit log from commit log directory (is there any SOP for this?). As commit logs are not per table or ks, we will have chalange in restoring selective tables. 3. Snapshot based backups are easy to manage and operate due to its simplicity. But they are heavy on storage. Any views on this? 4. Please share any successful strategy that someone is using for production. We are still in the design phase and want to implement the best solution. Thanks Eric for sharing link for medusa. Regards, Adarsh Kumar On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell wrote: > For me, I think the last one : > Snapshot + Incremental + commitlog > is the most meaningful way to do backup and restore, when you make the > data backup to some where else like AWS S3. > >- Snapshot based backup // for incremental data will not be >backuped and may lose data when restore to the time latter than > snapshot >time; >- Incremental backups // better than snapshot backup .but >with Insufficient data accuracy. For data remain in the memtable will > be >lose; >- Snapshot + incremental >- Snapshot + commitlog archival // better data precision than made >incremental backup, but the data in the non archived commitlog(not > archive >and commitlog log not closed) will not restore and will lose. Also > when log >is too much, do log reply will cost very mucu time > > For me ,We use snapshot + incremental + commitlog archive. We read > snapshot data and incremental data .Also the log is backuped .But we will > not backup the > log whose data have been flush to sstable ,for the data will be > backuped by the way we do incremental backup . > > This way , the data will exist in the format of sstable trough > snapshot backup and incremental backup . The log number will be very small > .And log replay will not cost much time. > > > > Eric LELEU 于2019年11月27日周三 下午4:13写道: > >> Hi, >> TheLastPickle & Spotify have released Medusa as Cassandra Backup tool. >> >> See : >> https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html >>
Re: Uneven token distribution with allocate_tokens_for_keyspace
Hi Enrico, This is a classic chicken and egg problem with the allocate_tokens_for_keyspace setting. The allocate_tokens_for_keyspace setting uses the replication factor of a DC keyspace to calculate the token allocation when a node is added to the cluster for the first time. Nodes need to be added to the new DC before we can replicate the keyspace over to it. Herein lies the problem. We are unable to use allocate_tokens_for_keyspace unless the keyspace is replicated to the new DC. In addition, as soon as you change the keyspace replication to the new DC, new data will start to be written to it. To work around this issue you will need to do the following. 1. Decommission all the nodes in the *dcNew*, one at a time. 2. Once all the *dcNew* nodes are decommissioned, wipe the contents in the *commitlog*, *data*, *saved_caches*, and *hints* directories of these nodes. 3. Make the first node to add into the *dcNew* a seed node. Set the seed list of the first node with its IP address and the IP addresses of the other seed nodes in the cluster. 4. Set the *initial_token* setting for the first node. You can calculate the values using the algorithm in my blog post: https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html. For convenience I have calculated them: *-9223372036854775808,-4611686018427387904,0,4611686018427387904*. Note, remove the *allocate_tokens_for_keyspace* setting from the *cassandra.yaml* file for this (seed) node. 5. Check to make sure that no other node in the cluster is assigned any of the four tokens specified above. If there is another node in the cluster that is assigned one of the above tokens, increment the conflicting token by values of one until no other node in the cluster is assigned that token value. The idea is to make sure that these four tokens are unique to the node. 6. Add the seed node to cluster. Make sure it is listed in *dcNew *by checking nodetool status. 7. Create a dummy keyspace in *dcNew* that has a replication factor of 2. 8. Set the *allocate_tokens_for_keyspace* value to be the name of the dummy keyspace for the other two nodes you want to add to *dcNew*. Note remove the *initial_token* setting for these other nodes. 9. Set *auto_bootstrap* to *false* for the other two nodes you want to add to *dcNew*. 10. Add the other two nodes to the cluster, one at a time. 11. If you are happy with the distribution, copy the data to *dcNew* by running a rebuild. Hope this helps. Regards, Anthony On Fri, 29 Nov 2019 at 02:08, Enrico Cavallin wrote: > Hi all, > I have an old datacenter with 4 nodes and 256 tokens each. > I am now starting a new datacenter with 3 nodes and num_token=4 > and allocate_tokens_for_keyspace=myBiggestKeyspace in each node. > Both DCs run Cassandra 3.11.x. > > myBiggestKeyspace has RF=3 in dcOld and RF=2 in dcNew. Now dcNew is very > unbalanced. > Also keyspaces with RF=2 in both DCs have the same problem. > Did I miss something or even with allocate_tokens_for_keyspace I have > strong limitations with low num_token? > Any suggestions on how to mitigate it? > > # nodetool status myBiggestKeyspace > Datacenter: dcOld > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN x.x.x.x 515.83 GiB 256 76.2% > fc462eb2-752f-4d26-aae3-84cb9c977b8a rack1 > UN x.x.x.x 504.09 GiB 256 72.7% > d7af8685-ba95-4854-a220-bc52dc242e9c rack1 > UN x.x.x.x 507.50 GiB 256 74.6% > b3a4d3d1-e87d-468b-a7d9-3c104e219536 rack1 > UN x.x.x.x 490.81 GiB 256 76.5% > 41e80c5b-e4e3-46f6-a16f-c784c0132dbc rack1 > > Datacenter: dcNew > == > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN x.x.x.x 145.47 KiB 456.3% > 7d089351-077f-4c36-a2f5-007682f9c215 rack1 > UN x.x.x.x 122.51 KiB 455.5% > 625dafcb-0822-4c8b-8551-5350c528907a rack1 > UN x.x.x.x 127.53 KiB 488.2% > c64c0ce4-2f85-4323-b0ba-71d70b8e6fbf rack1 > > Thanks, > -- ec >
Re: Optimal backup strategy
Hossein is right , But for use , we restore to the same cassandra topology ,So it is usable to do replay .But when restore to the same machine it is also usable . Using sstableloader cost too much time and more storage(though will reduce after restored) Hossein Ghiyasi Mehr 于2019年11月28日周四 下午7:40写道: > commitlog backup isn't usable in another machine. > Backup solution depends on what you want to do: periodic backup or backup > to restore on other machine? > Periodic backup is combine of snapshot and incremental backup. Remove > incremental backup after new snapshot. > Take backup to restore on other machine: You can use snapshot after > flushing memtable or Use sstableloader. > > > > VafaTech.com - A Total Solution for Data Gathering & Analysis > > On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell wrote: > >> for cassandra or datastax's documentation, commitlog's backup is not >> mentioned. >> only snapshot and incremental backup is described to do backup . >> >> Though commitlog's archive for keyspace/table is not support but >> commitlog' replay (though you must put log to commitlog_dir and restart the >> process) >> support the feature of keyspace/table' replay filter (using >> -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to >> replay the specified keyspace/table) >> >> Snapshot do affect the storage, for us we got snapshot one week a time >> under the low business peak and making snapshot got throttle ,for you you >> may >> see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019) >> >> >> >> Adarsh Kumar 于2019年11月28日周四 上午1:00写道: >> >>> Thanks Guo and Eric for replying, >>> >>> I have some confusions about commit log backup: >>> >>>1. commit log archival technique is ( >>> >>> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore- >>>) as good as an incremental backup, as it also captures commit logs after >>>memtable flush. >>>2. If we go for "Snapshot + Incremental bk + Commit log", here we >>>have to take commit log from commit log directory (is there any SOP for >>>this?). As commit logs are not per table or ks, we will have chalange in >>>restoring selective tables. >>>3. Snapshot based backups are easy to manage and operate due to its >>>simplicity. But they are heavy on storage. Any views on this? >>>4. Please share any successful strategy that someone is using for >>>production. We are still in the design phase and want to implement the >>> best >>>solution. >>> >>> Thanks Eric for sharing link for medusa. >>> >>> Regards, >>> Adarsh Kumar >>> >>> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell >>> wrote: >>> For me, I think the last one : Snapshot + Incremental + commitlog is the most meaningful way to do backup and restore, when you make the data backup to some where else like AWS S3. - Snapshot based backup // for incremental data will not be backuped and may lose data when restore to the time latter than snapshot time; - Incremental backups // better than snapshot backup .but with Insufficient data accuracy. For data remain in the memtable will be lose; - Snapshot + incremental - Snapshot + commitlog archival // better data precision than made incremental backup, but the data in the non archived commitlog(not archive and commitlog log not closed) will not restore and will lose. Also when log is too much, do log reply will cost very mucu time For me ,We use snapshot + incremental + commitlog archive. We read snapshot data and incremental data .Also the log is backuped .But we will not backup the log whose data have been flush to sstable ,for the data will be backuped by the way we do incremental backup . This way , the data will exist in the format of sstable trough snapshot backup and incremental backup . The log number will be very small .And log replay will not cost much time. Eric LELEU 于2019年11月27日周三 下午4:13写道: > Hi, > TheLastPickle & Spotify have released Medusa as Cassandra Backup tool. > > See : > https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html > > Hope this link will help you. > > Eric > > > Le 27/11/2019 à 08:10, Adarsh Kumar a écrit : > > Hi, > > I was looking for the backup strategies of Cassandra. After some study > I came to know that there are the following options: > >- Snapshot based backup >- Incremental backups >- Snapshot + incremental >- Snapshot + commitlog archival >- Snapshot + Incremental + commitlog > > Which is the most suitable and feasible approach? Also which of these > is used most. > Please let me know if there is any other
Uneven token distribution with allocate_tokens_for_keyspace
Hi all, I have an old datacenter with 4 nodes and 256 tokens each. I am now starting a new datacenter with 3 nodes and num_token=4 and allocate_tokens_for_keyspace=myBiggestKeyspace in each node. Both DCs run Cassandra 3.11.x. myBiggestKeyspace has RF=3 in dcOld and RF=2 in dcNew. Now dcNew is very unbalanced. Also keyspaces with RF=2 in both DCs have the same problem. Did I miss something or even with allocate_tokens_for_keyspace I have strong limitations with low num_token? Any suggestions on how to mitigate it? # nodetool status myBiggestKeyspace Datacenter: dcOld === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.x 515.83 GiB 256 76.2% fc462eb2-752f-4d26-aae3-84cb9c977b8a rack1 UN x.x.x.x 504.09 GiB 256 72.7% d7af8685-ba95-4854-a220-bc52dc242e9c rack1 UN x.x.x.x 507.50 GiB 256 74.6% b3a4d3d1-e87d-468b-a7d9-3c104e219536 rack1 UN x.x.x.x 490.81 GiB 256 76.5% 41e80c5b-e4e3-46f6-a16f-c784c0132dbc rack1 Datacenter: dcNew == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns (effective) Host ID Rack UN x.x.x.x 145.47 KiB 456.3% 7d089351-077f-4c36-a2f5-007682f9c215 rack1 UN x.x.x.x 122.51 KiB 455.5% 625dafcb-0822-4c8b-8551-5350c528907a rack1 UN x.x.x.x 127.53 KiB 488.2% c64c0ce4-2f85-4323-b0ba-71d70b8e6fbf rack1 Thanks, -- ec
Re: Optimal backup strategy
commitlog backup isn't usable in another machine. Backup solution depends on what you want to do: periodic backup or backup to restore on other machine? Periodic backup is combine of snapshot and incremental backup. Remove incremental backup after new snapshot. Take backup to restore on other machine: You can use snapshot after flushing memtable or Use sstableloader. VafaTech.com - A Total Solution for Data Gathering & Analysis On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell wrote: > for cassandra or datastax's documentation, commitlog's backup is not > mentioned. > only snapshot and incremental backup is described to do backup . > > Though commitlog's archive for keyspace/table is not support but > commitlog' replay (though you must put log to commitlog_dir and restart the > process) > support the feature of keyspace/table' replay filter (using > -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to > replay the specified keyspace/table) > > Snapshot do affect the storage, for us we got snapshot one week a time > under the low business peak and making snapshot got throttle ,for you you > may > see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019) > > > > Adarsh Kumar 于2019年11月28日周四 上午1:00写道: > >> Thanks Guo and Eric for replying, >> >> I have some confusions about commit log backup: >> >>1. commit log archival technique is ( >> >> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore- >>) as good as an incremental backup, as it also captures commit logs after >>memtable flush. >>2. If we go for "Snapshot + Incremental bk + Commit log", here we >>have to take commit log from commit log directory (is there any SOP for >>this?). As commit logs are not per table or ks, we will have chalange in >>restoring selective tables. >>3. Snapshot based backups are easy to manage and operate due to its >>simplicity. But they are heavy on storage. Any views on this? >>4. Please share any successful strategy that someone is using for >>production. We are still in the design phase and want to implement the >> best >>solution. >> >> Thanks Eric for sharing link for medusa. >> >> Regards, >> Adarsh Kumar >> >> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell wrote: >> >>> For me, I think the last one : >>> Snapshot + Incremental + commitlog >>> is the most meaningful way to do backup and restore, when you make the >>> data backup to some where else like AWS S3. >>> >>>- Snapshot based backup // for incremental data will not be backuped >>>and may lose data when restore to the time latter than snapshot time; >>>- Incremental backups // better than snapshot backup .but >>>with Insufficient data accuracy. For data remain in the memtable will be >>>lose; >>>- Snapshot + incremental >>>- Snapshot + commitlog archival // better data precision than made >>>incremental backup, but the data in the non archived commitlog(not >>> archive >>>and commitlog log not closed) will not restore and will lose. Also when >>> log >>>is too much, do log reply will cost very mucu time >>> >>> For me ,We use snapshot + incremental + commitlog archive. We read >>> snapshot data and incremental data .Also the log is backuped .But we will >>> not backup the >>> log whose data have been flush to sstable ,for the data will be backuped >>> by the way we do incremental backup . >>> >>> This way , the data will exist in the format of sstable trough snapshot >>> backup and incremental backup . The log number will be very small .And log >>> replay will not cost much time. >>> >>> >>> >>> Eric LELEU 于2019年11月27日周三 下午4:13写道: >>> Hi, TheLastPickle & Spotify have released Medusa as Cassandra Backup tool. See : https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html Hope this link will help you. Eric Le 27/11/2019 à 08:10, Adarsh Kumar a écrit : Hi, I was looking for the backup strategies of Cassandra. After some study I came to know that there are the following options: - Snapshot based backup - Incremental backups - Snapshot + incremental - Snapshot + commitlog archival - Snapshot + Incremental + commitlog Which is the most suitable and feasible approach? Also which of these is used most. Please let me know if there is any other option to tool available. Thanks in advance. Regards, Adarsh Kumar >>> >>> -- >>> you are the apple of my eye ! >>> >> > > -- > you are the apple of my eye ! >