efficiently generate complete database dump in text format

2014-10-09 Thread Gaurav Bhatnagar
Hi,
   We have a Cassandra database column family containing 320 millions rows
and each row contains about 15 columns. We want to take monthly dump of
this single column family contained in this database in text format.

We are planning to take following approach to implement this functionality
1. Take a snapshot of Cassandra database using nodetool utility. We specify
-cf flag to
 specify column family name so that snapshot contains data
corresponding to a single
 column family.
2. We take backup of this snapshot and move this backup to a separate
physical machine.
3. We using SStable to json conversion utility to json convert all the
data files into json
format.

We have following questions/doubts regarding the above approach
a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json
record
 and can I safely ignore all such json records?
b) If I ignore all records marked by d flag, than can generated json
files in step 3, contain
duplicate records? I mean do multiple entries for same key.

Do there can be any other better approach to generate data dumps in text
format.

Regards,
Gaurav


Re: efficiently generate complete database dump in text format

2014-10-09 Thread Paulo Ricardo Motta Gomes
The best way to generate dumps from Cassandra is via Hadoop integration (or
spark). You can find more info here:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html
http://wiki.apache.org/cassandra/HadoopSupport

On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com
wrote:

 Hi,
We have a Cassandra database column family containing 320 millions rows
 and each row contains about 15 columns. We want to take monthly dump of
 this single column family contained in this database in text format.

 We are planning to take following approach to implement this functionality
 1. Take a snapshot of Cassandra database using nodetool utility. We
 specify -cf flag to
  specify column family name so that snapshot contains data
 corresponding to a single
  column family.
 2. We take backup of this snapshot and move this backup to a separate
 physical machine.
 3. We using SStable to json conversion utility to json convert all the
 data files into json
 format.

 We have following questions/doubts regarding the above approach
 a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json
 record
  and can I safely ignore all such json records?
 b) If I ignore all records marked by d flag, than can generated json
 files in step 3, contain
 duplicate records? I mean do multiple entries for same key.

 Do there can be any other better approach to generate data dumps in text
 format.

 Regards,
 Gaurav




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: efficiently generate complete database dump in text format

2014-10-09 Thread Daniel Chia
You might also want to consider tools like
https://github.com/Netflix/aegisthus for the last step, which can help you
deal with tombstones and de-duplicate data.

Thanks,
Daniel

On Thu, Oct 9, 2014 at 12:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com
wrote:

 Hi,
We have a Cassandra database column family containing 320 millions rows
 and each row contains about 15 columns. We want to take monthly dump of
 this single column family contained in this database in text format.

 We are planning to take following approach to implement this functionality
 1. Take a snapshot of Cassandra database using nodetool utility. We
 specify -cf flag to
  specify column family name so that snapshot contains data
 corresponding to a single
  column family.
 2. We take backup of this snapshot and move this backup to a separate
 physical machine.
 3. We using SStable to json conversion utility to json convert all the
 data files into json
 format.

 We have following questions/doubts regarding the above approach
 a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json
 record
  and can I safely ignore all such json records?
 b) If I ignore all records marked by d flag, than can generated json
 files in step 3, contain
 duplicate records? I mean do multiple entries for same key.

 Do there can be any other better approach to generate data dumps in text
 format.

 Regards,
 Gaurav