Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format.
We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using "SStable to json conversion" utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by "d" flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav