efficiently generate complete database dump in text format
Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format. We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using SStable to json conversion utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by d flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav
Re: efficiently generate complete database dump in text format
The best way to generate dumps from Cassandra is via Hadoop integration (or spark). You can find more info here: http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html http://wiki.apache.org/cassandra/HadoopSupport On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com wrote: Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format. We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using SStable to json conversion utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by d flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: efficiently generate complete database dump in text format
You might also want to consider tools like https://github.com/Netflix/aegisthus for the last step, which can help you deal with tombstones and de-duplicate data. Thanks, Daniel On Thu, Oct 9, 2014 at 12:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com wrote: Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format. We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using SStable to json conversion utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by d flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav