efficiently generate complete database dump in text format

2014-10-09 Thread Gaurav Bhatnagar
Hi,
   We have a Cassandra database column family containing 320 millions rows
and each row contains about 15 columns. We want to take monthly dump of
this single column family contained in this database in text format.

We are planning to take following approach to implement this functionality
1. Take a snapshot of Cassandra database using nodetool utility. We specify
-cf flag to
 specify column family name so that snapshot contains data
corresponding to a single
 column family.
2. We take backup of this snapshot and move this backup to a separate
physical machine.
3. We using SStable to json conversion utility to json convert all the
data files into json
format.

We have following questions/doubts regarding the above approach
a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json
record
 and can I safely ignore all such json records?
b) If I ignore all records marked by d flag, than can generated json
files in step 3, contain
duplicate records? I mean do multiple entries for same key.

Do there can be any other better approach to generate data dumps in text
format.

Regards,
Gaurav


RE: Doubts with the values of the parameter broadcast_rpc_address

2014-10-09 Thread Ricard Mestre Subirats
Thank you Tyler! It’s true that the cluster works!
However I’m going to ask for the error of nodetool because it interest us.

Thank you!

Ricard

De: Tyler Hobbs [mailto:ty...@datastax.com]
Enviado el: miércoles, 08 de octubre de 2014 17:25
Para: user@cassandra.apache.org
Asunto: Re: Doubts with the values of the parameter broadcast_rpc_address


On Wed, Oct 8, 2014 at 5:20 AM, Ricard Mestre Subirats 
ricard.mestre.subir...@everis.commailto:ricard.mestre.subir...@everis.com 
wrote:
At the machine with IP 192.168.150.112http://192.168.150.112:
-cluster_name: 'CassandraCluster1'
-seeds: 192.168.150.112
-listen_address: 192.168.150.112
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.112

At the machine with IP 192.168.150.113http://192.168.150.113:
-cluster_name: 'CassandraCluster1'
-seeds: 192.168.150.112
-listen_address: 192.168.150.113
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.113

This is the correct configuration.


Then, if we start the service and execute “nodetool status” the result is the 
following:
nodetool: Failed to connect to '127.0.0.1:7199http://127.0.0.1:7199' - 
NoRouteToHostException: 'There is not any route to the `host''.

Nodetool does not (generally) use rpc_address/broadcast_rpc_address, because 
it's not using the normal API, it's using JMX.  This is a different problem.  
If you want to check rpc_address/broadcast_rpc_address, use cqlsh (and pass an 
address).

You can specify a hostname for nodetool with the -h option: nodetool -h 
192.168.150.112 status.  Depending on your setup, you may also need to edit the 
line in conf/cassandra-env.sh that sets this option:  
-Djava.rmi.server.hostname=public name


--
Tyler Hobbs
DataStaxhttp://datastax.com/



AVISO DE CONFIDENCIALIDAD.
Este correo y la información contenida o adjunta al mismo es privada y 
confidencial y va dirigida exclusivamente a su destinatario. everis informa a 
quien pueda haber recibido este correo por error que contiene información 
confidencial cuyo uso, copia, reproducción o distribución está expresamente 
prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por 
error, le rogamos lo ponga en conocimiento del emisor y proceda a su 
eliminación sin copiarlo, imprimirlo o utilizarlo de ningún modo.

CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Error conection in Nodetool

2014-10-09 Thread Ricard Mestre Subirats
Hi everyone,

I have an issue with nodetool conection. I'm configuring a 2 node cluster with 
Cassandra 2.1 and 8.0 Java version, and I'm having problems with nodetool. The 
configuration of the machines is the following:

At the machine with IP 192.168.150.112http://192.168.150.112:
-cluster_name: 'CassandraCluster1'
-seeds: 192.168.150.112
-listen_address: 192.168.150.112
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.112

At the machine with IP 192.168.150.113http://192.168.150.113:
-cluster_name: 'CassandraCluster1'
-seeds: 192.168.150.112
-listen_address: 192.168.150.113
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.113
When I execute nodetool status the result is the following error:
nodetool: Failed to connect to '127.0.0.1:7199' - NoRouteToHostException: 
'There is not any route to the `host''.

I have read answers for similar error pointing that you can solve this 
configuring the following parameter of the cassandra-env.sh:

JVM_OPTS=$JVM_OPTS -Djava.rmi.server.hostname=cassandra2

I have configured this parameter in both machines but the error persists.

Can someone help me?
Thank you!

Ricard



AVISO DE CONFIDENCIALIDAD.
Este correo y la informaci?n contenida o adjunta al mismo es privada y 
confidencial y va dirigida exclusivamente a su destinatario. everis informa a 
quien pueda haber recibido este correo por error que contiene informaci?n 
confidencial cuyo uso, copia, reproducci?n o distribuci?n est? expresamente 
prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por 
error, le rogamos lo ponga en conocimiento del emisor y proceda a su 
eliminaci?n sin copiarlo, imprimirlo o utilizarlo de ning?n modo.

CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Re: Multi-DC Repairs and Token Questions

2014-10-09 Thread Alain RODRIGUEZ
Ok got it.

Thanks.

2014-10-07 14:56 GMT+02:00 Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com:

 This related issue might be of interest:
 https://issues.apache.org/jira/browse/CASSANDRA-7450

 In 1.2 -pr option does make cross DC repairs, but you must ensure that
 all nodes from all datacenter execute repair, otherwise some ranges will be
 missing. This fix enables -pr and -local together, which was disabled in
 2.0 because it didn't work (it also does not work in 1.2).

 On Tue, Oct 7, 2014 at 5:46 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi guys, sorry about digging this up, but, is this bug also affecting
 1.2.x versions ? I can't see this being backported to 1.2 on the Jira. Was
 this bug introduced in 2.0 ?

 Anyway, how does nodetool repair -pr behave on a multi DC env, does it
 make cross DC repairs or not ? Should we remove the pr option in a multi
 DC context to remove entropy between DCs ? I mean a repair -pr is supposed
 to repair the primary range for the current node, does it also repair
 corresponding primary range in other DCs ?

 Thanks for insight around this.

 2014-06-03 8:06 GMT+02:00 Nick Bailey n...@datastax.com:

 See https://issues.apache.org/jira/browse/CASSANDRA-7317


 On Mon, Jun 2, 2014 at 8:57 PM, Matthew Allen matthew.j.al...@gmail.com
  wrote:

 Hi Rameez, Chovatia, (sorry I initially replied to Dwight individually)

 SN_KEYSPACE and MY_KEYSPACE are just typos (was try to mask out
 identifiable information), they are same keyspace.

 Keyspace: SN_KEYSPACE:
   Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy
   Durable Writes: true
 Options: [DC_VIC:2, DC_NSW:2]

 In a nutshell, replication is working as expected, I'm just confused
 about token range assignments in a Multi-DC environment and how repairs
 should work

 From
 http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html,
 it specifies

 *Multiple data center deployments: calculate the tokens for
 each data center so that the hash range is evenly divided for the nodes in
 each data center*

 Given that nodetool -repair isn't multi-dc aware, in our production 18
 node cluster (9 nodes in each DC), which of the following token ranges
 should be used (Murmur3 Partitioner) ?

 Token range divided evenly over the 2 DC's/18 nodes as below ?

 Node DC_NSWDC_VIC
 1'-9223372036854775808''-8198552921648689608'
 2'-7173733806442603408''-6148914691236517208'
 3'-5124095576030431008''-4099276460824344808'
 4'-3074457345618258608''-2049638230412172408'
 5'-1024819115206086208''-8'
 6'1024819115206086192' '2049638230412172392'
 7'3074457345618258592' '4099276460824344792'
 8'5124095576030430992' '6148914691236517192'
 9'7173733806442603392' '8198552921648689592'

 Or An offset used for DC_VIC (i.e. DC_NSW + 100) ?

 Node DC_NSW DC_VIC
 1 '-9223372036854775808''-9223372036854775708'
 2 '-7173733806442603407''-7173733806442603307'
 3 '-5124095576030431006''-5124095576030430906'
 4 '-3074457345618258605''-3074457345618258505'
 5 '-1024819115206086204''-1024819115206086104'
 6 '1024819115206086197' '1024819115206086297'
 7 '3074457345618258598' '3074457345618258698'
 8 '5124095576030430999' '5124095576030431099'
 9 '7173733806442603400' '7173733806442603500'

 It's too late for me to switch to vnodes, hope that makes sense, thanks

 Matt



 On Thu, May 29, 2014 at 12:01 AM, Rameez Thonnakkal ssram...@gmail.com
  wrote:

 as Chovatia mentioned, the keyspaces seems to be different.
 try Describe keyspace SN_KEYSPACE and describe keyspace
 MY_KEYSPACE from CQL.
 This will give you an idea about how many replicas are there for these
 keyspaces.



 On Wed, May 28, 2014 at 11:49 AM, chovatia jaydeep 
 chovatia_jayd...@yahoo.co.in wrote:

 What is your partition type? Is
 it org.apache.cassandra.dht.Murmur3Partitioner?
 In your repair command i do see there are two different KeySpaces 
 MY_KEYSPACE
 and SN_KEYSPACE, are these two separate key spaces or typo?

 -jaydeep


   On Tuesday, 27 May 2014 10:26 PM, Matthew Allen 
 matthew.j.al...@gmail.com wrote:


 Hi,

 Am a bit confused regarding data ownership in a multi-dc environment.

 I have the following setup in a test cluster with a keyspace with
 (placement_strategy = 'NetworkTopologyStrategy' and strategy_options =
 {'DC_NSW':2,'DC_VIC':2};)

 Datacenter: DC_NSW
 ==
 Replicas: 2
 Address RackStatus State   Load
 OwnsToken

 0
 nsw1  rack1   Up Normal  1007.43 MB  100.00%
 -9223372036854775808
 nsw2  rack1   Up Normal  1008.08 MB  100.00% 0


 Datacenter: DC_VIC
 ==
 Replicas: 2
 Address RackStatus State   Load
 OwnsToken

 100
 vic1   rack1   Up Normal  1015.1 MB   100.00%
 

Re: efficiently generate complete database dump in text format

2014-10-09 Thread Paulo Ricardo Motta Gomes
The best way to generate dumps from Cassandra is via Hadoop integration (or
spark). You can find more info here:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html
http://wiki.apache.org/cassandra/HadoopSupport

On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com
wrote:

 Hi,
We have a Cassandra database column family containing 320 millions rows
 and each row contains about 15 columns. We want to take monthly dump of
 this single column family contained in this database in text format.

 We are planning to take following approach to implement this functionality
 1. Take a snapshot of Cassandra database using nodetool utility. We
 specify -cf flag to
  specify column family name so that snapshot contains data
 corresponding to a single
  column family.
 2. We take backup of this snapshot and move this backup to a separate
 physical machine.
 3. We using SStable to json conversion utility to json convert all the
 data files into json
 format.

 We have following questions/doubts regarding the above approach
 a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json
 record
  and can I safely ignore all such json records?
 b) If I ignore all records marked by d flag, than can generated json
 files in step 3, contain
 duplicate records? I mean do multiple entries for same key.

 Do there can be any other better approach to generate data dumps in text
 format.

 Regards,
 Gaurav




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: efficiently generate complete database dump in text format

2014-10-09 Thread Daniel Chia
You might also want to consider tools like
https://github.com/Netflix/aegisthus for the last step, which can help you
deal with tombstones and de-duplicate data.

Thanks,
Daniel

On Thu, Oct 9, 2014 at 12:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com
wrote:

 Hi,
We have a Cassandra database column family containing 320 millions rows
 and each row contains about 15 columns. We want to take monthly dump of
 this single column family contained in this database in text format.

 We are planning to take following approach to implement this functionality
 1. Take a snapshot of Cassandra database using nodetool utility. We
 specify -cf flag to
  specify column family name so that snapshot contains data
 corresponding to a single
  column family.
 2. We take backup of this snapshot and move this backup to a separate
 physical machine.
 3. We using SStable to json conversion utility to json convert all the
 data files into json
 format.

 We have following questions/doubts regarding the above approach
 a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json
 record
  and can I safely ignore all such json records?
 b) If I ignore all records marked by d flag, than can generated json
 files in step 3, contain
 duplicate records? I mean do multiple entries for same key.

 Do there can be any other better approach to generate data dumps in text
 format.

 Regards,
 Gaurav



Disabling compaction

2014-10-09 Thread Parag Shah
Hi all,

 I am trying to disable compaction for a few select tables. Here is a 
definition of one such table:

CREATE TABLE blob_2014_12_31 (
  blob_id uuid,
  blob_index int,
  blob_chunk blob,
  PRIMARY KEY (blob_id, blob_index)
) WITH
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'enabled': 'false', 'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor’};

I have set compaction ‘enabled’ : ‘false’ on the above table.

However, I do see compactions being run for this node:

-bash-3.2$ nodetool compactionstats
pending tasks: 55
  compaction typekeyspace   table   completed   
total  unit  progress
   Compaction ids_high_awslab blob_2014_11_15 18122816990 
35814893020 bytes50.60%
   Compaction ids_high_awslab blob_2014_12_31 18576750966 
34242866468 bytes54.25%
   Compaction ids_high_awslab blob_2014_12_15 19213914904 
35956698600 bytes53.44%
Active compaction remaining time :   0h49m46s

Can you someone tell me why this is happening? Do I need to set the compaction 
threshold  to 0 0?

Regards
Parag


Deleting counters

2014-10-09 Thread Robert Wille
At the Cassandra Summit I became aware of that there are issues with deleting 
counters. I have a few questions about that. What is the bad thing that happens 
(or can possibly happen) when a counter is deleted? Is it safe to delete an 
entire row of counters? Is there any 2.0.x version of Cassandra in which it is 
safe to delete counters? Is there an access pattern in which it is safe to 
delete counters in 2.0.x?

Thanks

Robert



Re: Deleting counters

2014-10-09 Thread Robert Coli
On Thu, Oct 9, 2014 at 3:29 PM, Robert Wille rwi...@fold3.com wrote:

 What is the bad thing that happens (or can possibly happen) when a counter
 is deleted?


You get totally wacky counter state if you, later, re-create and use it.

https://issues.apache.org/jira/browse/CASSANDRA-6532
and
https://issues.apache.org/jira/browse/CASSANDRA-7346

Have some good discussion.

Is it safe to delete an entire row of counters?


Not unless :
a) you will never use that particular counter row again
OR
b) gc_grace_seconds has passed and you have repaired and run a major
compaction on every node, such that 100% of such tombstones have been
removed


 Is there any 2.0.x version of Cassandra in which it is safe to delete
 counters?


See above, 2.0 is the same as any other version for this behavior.

Is there an access pattern in which it is safe to delete counters in 2.0.x?


The summary of the above is that, in practice, it is only safe to delete
counters if you plan to never use that particular counter row again.

=Rob