efficiently generate complete database dump in text format
Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format. We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using SStable to json conversion utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by d flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav
RE: Doubts with the values of the parameter broadcast_rpc_address
Thank you Tyler! It’s true that the cluster works! However I’m going to ask for the error of nodetool because it interest us. Thank you! Ricard De: Tyler Hobbs [mailto:ty...@datastax.com] Enviado el: miércoles, 08 de octubre de 2014 17:25 Para: user@cassandra.apache.org Asunto: Re: Doubts with the values of the parameter broadcast_rpc_address On Wed, Oct 8, 2014 at 5:20 AM, Ricard Mestre Subirats ricard.mestre.subir...@everis.commailto:ricard.mestre.subir...@everis.com wrote: At the machine with IP 192.168.150.112http://192.168.150.112: -cluster_name: 'CassandraCluster1' -seeds: 192.168.150.112 -listen_address: 192.168.150.112 -rpc_address: 0.0.0.0 -broadcast_rpc_address: 192.168.150.112 At the machine with IP 192.168.150.113http://192.168.150.113: -cluster_name: 'CassandraCluster1' -seeds: 192.168.150.112 -listen_address: 192.168.150.113 -rpc_address: 0.0.0.0 -broadcast_rpc_address: 192.168.150.113 This is the correct configuration. Then, if we start the service and execute “nodetool status” the result is the following: nodetool: Failed to connect to '127.0.0.1:7199http://127.0.0.1:7199' - NoRouteToHostException: 'There is not any route to the `host''. Nodetool does not (generally) use rpc_address/broadcast_rpc_address, because it's not using the normal API, it's using JMX. This is a different problem. If you want to check rpc_address/broadcast_rpc_address, use cqlsh (and pass an address). You can specify a hostname for nodetool with the -h option: nodetool -h 192.168.150.112 status. Depending on your setup, you may also need to edit the line in conf/cassandra-env.sh that sets this option: -Djava.rmi.server.hostname=public name -- Tyler Hobbs DataStaxhttp://datastax.com/ AVISO DE CONFIDENCIALIDAD. Este correo y la información contenida o adjunta al mismo es privada y confidencial y va dirigida exclusivamente a su destinatario. everis informa a quien pueda haber recibido este correo por error que contiene información confidencial cuyo uso, copia, reproducción o distribución está expresamente prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por error, le rogamos lo ponga en conocimiento del emisor y proceda a su eliminación sin copiarlo, imprimirlo o utilizarlo de ningún modo. CONFIDENTIALITY WARNING. This message and the information contained in or attached to it are private and confidential and intended exclusively for the addressee. everis informs to whom it may receive it in error that it contains privileged information and its use, copy, reproduction or distribution is prohibited. If you are not an intended recipient of this E-mail, please notify the sender, delete it and do not read, act upon, print, disclose, copy, retain or redistribute any portion of this E-mail.
Error conection in Nodetool
Hi everyone, I have an issue with nodetool conection. I'm configuring a 2 node cluster with Cassandra 2.1 and 8.0 Java version, and I'm having problems with nodetool. The configuration of the machines is the following: At the machine with IP 192.168.150.112http://192.168.150.112: -cluster_name: 'CassandraCluster1' -seeds: 192.168.150.112 -listen_address: 192.168.150.112 -rpc_address: 0.0.0.0 -broadcast_rpc_address: 192.168.150.112 At the machine with IP 192.168.150.113http://192.168.150.113: -cluster_name: 'CassandraCluster1' -seeds: 192.168.150.112 -listen_address: 192.168.150.113 -rpc_address: 0.0.0.0 -broadcast_rpc_address: 192.168.150.113 When I execute nodetool status the result is the following error: nodetool: Failed to connect to '127.0.0.1:7199' - NoRouteToHostException: 'There is not any route to the `host''. I have read answers for similar error pointing that you can solve this configuring the following parameter of the cassandra-env.sh: JVM_OPTS=$JVM_OPTS -Djava.rmi.server.hostname=cassandra2 I have configured this parameter in both machines but the error persists. Can someone help me? Thank you! Ricard AVISO DE CONFIDENCIALIDAD. Este correo y la informaci?n contenida o adjunta al mismo es privada y confidencial y va dirigida exclusivamente a su destinatario. everis informa a quien pueda haber recibido este correo por error que contiene informaci?n confidencial cuyo uso, copia, reproducci?n o distribuci?n est? expresamente prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por error, le rogamos lo ponga en conocimiento del emisor y proceda a su eliminaci?n sin copiarlo, imprimirlo o utilizarlo de ning?n modo. CONFIDENTIALITY WARNING. This message and the information contained in or attached to it are private and confidential and intended exclusively for the addressee. everis informs to whom it may receive it in error that it contains privileged information and its use, copy, reproduction or distribution is prohibited. If you are not an intended recipient of this E-mail, please notify the sender, delete it and do not read, act upon, print, disclose, copy, retain or redistribute any portion of this E-mail.
Re: Multi-DC Repairs and Token Questions
Ok got it. Thanks. 2014-10-07 14:56 GMT+02:00 Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com: This related issue might be of interest: https://issues.apache.org/jira/browse/CASSANDRA-7450 In 1.2 -pr option does make cross DC repairs, but you must ensure that all nodes from all datacenter execute repair, otherwise some ranges will be missing. This fix enables -pr and -local together, which was disabled in 2.0 because it didn't work (it also does not work in 1.2). On Tue, Oct 7, 2014 at 5:46 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, sorry about digging this up, but, is this bug also affecting 1.2.x versions ? I can't see this being backported to 1.2 on the Jira. Was this bug introduced in 2.0 ? Anyway, how does nodetool repair -pr behave on a multi DC env, does it make cross DC repairs or not ? Should we remove the pr option in a multi DC context to remove entropy between DCs ? I mean a repair -pr is supposed to repair the primary range for the current node, does it also repair corresponding primary range in other DCs ? Thanks for insight around this. 2014-06-03 8:06 GMT+02:00 Nick Bailey n...@datastax.com: See https://issues.apache.org/jira/browse/CASSANDRA-7317 On Mon, Jun 2, 2014 at 8:57 PM, Matthew Allen matthew.j.al...@gmail.com wrote: Hi Rameez, Chovatia, (sorry I initially replied to Dwight individually) SN_KEYSPACE and MY_KEYSPACE are just typos (was try to mask out identifiable information), they are same keyspace. Keyspace: SN_KEYSPACE: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [DC_VIC:2, DC_NSW:2] In a nutshell, replication is working as expected, I'm just confused about token range assignments in a Multi-DC environment and how repairs should work From http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html, it specifies *Multiple data center deployments: calculate the tokens for each data center so that the hash range is evenly divided for the nodes in each data center* Given that nodetool -repair isn't multi-dc aware, in our production 18 node cluster (9 nodes in each DC), which of the following token ranges should be used (Murmur3 Partitioner) ? Token range divided evenly over the 2 DC's/18 nodes as below ? Node DC_NSWDC_VIC 1'-9223372036854775808''-8198552921648689608' 2'-7173733806442603408''-6148914691236517208' 3'-5124095576030431008''-4099276460824344808' 4'-3074457345618258608''-2049638230412172408' 5'-1024819115206086208''-8' 6'1024819115206086192' '2049638230412172392' 7'3074457345618258592' '4099276460824344792' 8'5124095576030430992' '6148914691236517192' 9'7173733806442603392' '8198552921648689592' Or An offset used for DC_VIC (i.e. DC_NSW + 100) ? Node DC_NSW DC_VIC 1 '-9223372036854775808''-9223372036854775708' 2 '-7173733806442603407''-7173733806442603307' 3 '-5124095576030431006''-5124095576030430906' 4 '-3074457345618258605''-3074457345618258505' 5 '-1024819115206086204''-1024819115206086104' 6 '1024819115206086197' '1024819115206086297' 7 '3074457345618258598' '3074457345618258698' 8 '5124095576030430999' '5124095576030431099' 9 '7173733806442603400' '7173733806442603500' It's too late for me to switch to vnodes, hope that makes sense, thanks Matt On Thu, May 29, 2014 at 12:01 AM, Rameez Thonnakkal ssram...@gmail.com wrote: as Chovatia mentioned, the keyspaces seems to be different. try Describe keyspace SN_KEYSPACE and describe keyspace MY_KEYSPACE from CQL. This will give you an idea about how many replicas are there for these keyspaces. On Wed, May 28, 2014 at 11:49 AM, chovatia jaydeep chovatia_jayd...@yahoo.co.in wrote: What is your partition type? Is it org.apache.cassandra.dht.Murmur3Partitioner? In your repair command i do see there are two different KeySpaces MY_KEYSPACE and SN_KEYSPACE, are these two separate key spaces or typo? -jaydeep On Tuesday, 27 May 2014 10:26 PM, Matthew Allen matthew.j.al...@gmail.com wrote: Hi, Am a bit confused regarding data ownership in a multi-dc environment. I have the following setup in a test cluster with a keyspace with (placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {'DC_NSW':2,'DC_VIC':2};) Datacenter: DC_NSW == Replicas: 2 Address RackStatus State Load OwnsToken 0 nsw1 rack1 Up Normal 1007.43 MB 100.00% -9223372036854775808 nsw2 rack1 Up Normal 1008.08 MB 100.00% 0 Datacenter: DC_VIC == Replicas: 2 Address RackStatus State Load OwnsToken 100 vic1 rack1 Up Normal 1015.1 MB 100.00%
Re: efficiently generate complete database dump in text format
The best way to generate dumps from Cassandra is via Hadoop integration (or spark). You can find more info here: http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html http://wiki.apache.org/cassandra/HadoopSupport On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com wrote: Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format. We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using SStable to json conversion utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by d flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: efficiently generate complete database dump in text format
You might also want to consider tools like https://github.com/Netflix/aegisthus for the last step, which can help you deal with tombstones and de-duplicate data. Thanks, Daniel On Thu, Oct 9, 2014 at 12:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com wrote: Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format. We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using SStable to json conversion utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by d flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav
Disabling compaction
Hi all, I am trying to disable compaction for a few select tables. Here is a definition of one such table: CREATE TABLE blob_2014_12_31 ( blob_id uuid, blob_index int, blob_chunk blob, PRIMARY KEY (blob_id, blob_index) ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'enabled': 'false', 'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor’}; I have set compaction ‘enabled’ : ‘false’ on the above table. However, I do see compactions being run for this node: -bash-3.2$ nodetool compactionstats pending tasks: 55 compaction typekeyspace table completed total unit progress Compaction ids_high_awslab blob_2014_11_15 18122816990 35814893020 bytes50.60% Compaction ids_high_awslab blob_2014_12_31 18576750966 34242866468 bytes54.25% Compaction ids_high_awslab blob_2014_12_15 19213914904 35956698600 bytes53.44% Active compaction remaining time : 0h49m46s Can you someone tell me why this is happening? Do I need to set the compaction threshold to 0 0? Regards Parag
Deleting counters
At the Cassandra Summit I became aware of that there are issues with deleting counters. I have a few questions about that. What is the bad thing that happens (or can possibly happen) when a counter is deleted? Is it safe to delete an entire row of counters? Is there any 2.0.x version of Cassandra in which it is safe to delete counters? Is there an access pattern in which it is safe to delete counters in 2.0.x? Thanks Robert
Re: Deleting counters
On Thu, Oct 9, 2014 at 3:29 PM, Robert Wille rwi...@fold3.com wrote: What is the bad thing that happens (or can possibly happen) when a counter is deleted? You get totally wacky counter state if you, later, re-create and use it. https://issues.apache.org/jira/browse/CASSANDRA-6532 and https://issues.apache.org/jira/browse/CASSANDRA-7346 Have some good discussion. Is it safe to delete an entire row of counters? Not unless : a) you will never use that particular counter row again OR b) gc_grace_seconds has passed and you have repaired and run a major compaction on every node, such that 100% of such tombstones have been removed Is there any 2.0.x version of Cassandra in which it is safe to delete counters? See above, 2.0 is the same as any other version for this behavior. Is there an access pattern in which it is safe to delete counters in 2.0.x? The summary of the above is that, in practice, it is only safe to delete counters if you plan to never use that particular counter row again. =Rob