efficiently generate complete database dump in text format

2014-10-09 Thread Gaurav Bhatnagar
Hi,
   We have a Cassandra database column family containing 320 millions rows
and each row contains about 15 columns. We want to take monthly dump of
this single column family contained in this database in text format.

We are planning to take following approach to implement this functionality
1. Take a snapshot of Cassandra database using nodetool utility. We specify
-cf flag to
 specify column family name so that snapshot contains data
corresponding to a single
 column family.
2. We take backup of this snapshot and move this backup to a separate
physical machine.
3. We using "SStable to json conversion" utility to json convert all the
data files into json
format.

We have following questions/doubts regarding the above approach
a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json
record
 and can I safely ignore all such json records?
b) If I ignore all records marked by "d" flag, than can generated json
files in step 3, contain
duplicate records? I mean do multiple entries for same key.

Do there can be any other better approach to generate data dumps in text
format.

Regards,
Gaurav


RE: Doubts with the values of the parameter broadcast_rpc_address

2014-10-09 Thread Ricard Mestre Subirats
Thank you Tyler! It’s true that the cluster works!
However I’m going to ask for the error of nodetool because it interest us.

Thank you!

Ricard

De: Tyler Hobbs [mailto:ty...@datastax.com]
Enviado el: miércoles, 08 de octubre de 2014 17:25
Para: user@cassandra.apache.org
Asunto: Re: Doubts with the values of the parameter broadcast_rpc_address


On Wed, Oct 8, 2014 at 5:20 AM, Ricard Mestre Subirats 
mailto:ricard.mestre.subir...@everis.com>> 
wrote:
At the machine with IP 192.168.150.112:
-cluster_name: 'CassandraCluster1'
-seeds: "192.168.150.112"
-listen_address: 192.168.150.112
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.112

At the machine with IP 192.168.150.113:
-cluster_name: 'CassandraCluster1'
-seeds: "192.168.150.112"
-listen_address: 192.168.150.113
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.113

This is the correct configuration.


Then, if we start the service and execute “nodetool status” the result is the 
following:
nodetool: Failed to connect to '127.0.0.1:7199' - 
NoRouteToHostException: 'There is not any route to the `host''.

Nodetool does not (generally) use rpc_address/broadcast_rpc_address, because 
it's not using the normal API, it's using JMX.  This is a different problem.  
If you want to check rpc_address/broadcast_rpc_address, use cqlsh (and pass an 
address).

You can specify a hostname for nodetool with the -h option: nodetool -h 
192.168.150.112 status.  Depending on your setup, you may also need to edit the 
line in conf/cassandra-env.sh that sets this option:  
-Djava.rmi.server.hostname=


--
Tyler Hobbs
DataStax



AVISO DE CONFIDENCIALIDAD.
Este correo y la información contenida o adjunta al mismo es privada y 
confidencial y va dirigida exclusivamente a su destinatario. everis informa a 
quien pueda haber recibido este correo por error que contiene información 
confidencial cuyo uso, copia, reproducción o distribución está expresamente 
prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por 
error, le rogamos lo ponga en conocimiento del emisor y proceda a su 
eliminación sin copiarlo, imprimirlo o utilizarlo de ningún modo.

CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Error conection in Nodetool

2014-10-09 Thread Ricard Mestre Subirats
Hi everyone,

I have an issue with nodetool conection. I'm configuring a 2 node cluster with 
Cassandra 2.1 and 8.0 Java version, and I'm having problems with nodetool. The 
configuration of the machines is the following:

At the machine with IP 192.168.150.112:
-cluster_name: 'CassandraCluster1'
-seeds: "192.168.150.112"
-listen_address: 192.168.150.112
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.112

At the machine with IP 192.168.150.113:
-cluster_name: 'CassandraCluster1'
-seeds: "192.168.150.112"
-listen_address: 192.168.150.113
-rpc_address: 0.0.0.0
-broadcast_rpc_address: 192.168.150.113
When I execute "nodetool status" the result is the following error:
nodetool: Failed to connect to '127.0.0.1:7199' - NoRouteToHostException: 
'There is not any route to the `host''.

I have read answers for similar error pointing that you can solve this 
configuring the following parameter of the cassandra-env.sh:

JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=cassandra2"

I have configured this parameter in both machines but the error persists.

Can someone help me?
Thank you!

Ricard



AVISO DE CONFIDENCIALIDAD.
Este correo y la informaci?n contenida o adjunta al mismo es privada y 
confidencial y va dirigida exclusivamente a su destinatario. everis informa a 
quien pueda haber recibido este correo por error que contiene informaci?n 
confidencial cuyo uso, copia, reproducci?n o distribuci?n est? expresamente 
prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por 
error, le rogamos lo ponga en conocimiento del emisor y proceda a su 
eliminaci?n sin copiarlo, imprimirlo o utilizarlo de ning?n modo.

CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Re: Multi-DC Repairs and Token Questions

2014-10-09 Thread Alain RODRIGUEZ
Ok got it.

Thanks.

2014-10-07 14:56 GMT+02:00 Paulo Ricardo Motta Gomes <
paulo.mo...@chaordicsystems.com>:

> This related issue might be of interest:
> https://issues.apache.org/jira/browse/CASSANDRA-7450
>
> In 1.2 "-pr" option does make cross DC repairs, but you must ensure that
> all nodes from all datacenter execute repair, otherwise some ranges will be
> missing. This fix enables -pr and -local together, which was disabled in
> 2.0 because it didn't work (it also does not work in 1.2).
>
> On Tue, Oct 7, 2014 at 5:46 AM, Alain RODRIGUEZ 
> wrote:
>
>> Hi guys, sorry about digging this up, but, is this bug also affecting
>> 1.2.x versions ? I can't see this being backported to 1.2 on the Jira. Was
>> this bug introduced in 2.0 ?
>>
>> Anyway, how does nodetool repair -pr behave on a multi DC env, does it
>> make cross DC repairs or not ? Should we remove the "pr" option in a multi
>> DC context to remove entropy between DCs ? I mean a repair -pr is supposed
>> to repair the primary range for the current node, does it also repair
>> corresponding primary range in other DCs ?
>>
>> Thanks for insight around this.
>>
>> 2014-06-03 8:06 GMT+02:00 Nick Bailey :
>>
>>> See https://issues.apache.org/jira/browse/CASSANDRA-7317
>>>
>>>
>>> On Mon, Jun 2, 2014 at 8:57 PM, Matthew Allen >> > wrote:
>>>
 Hi Rameez, Chovatia, (sorry I initially replied to Dwight individually)

 SN_KEYSPACE and MY_KEYSPACE are just typos (was try to mask out
 identifiable information), they are same keyspace.

 Keyspace: SN_KEYSPACE:
   Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy
   Durable Writes: true
 Options: [DC_VIC:2, DC_NSW:2]

 In a nutshell, replication is working as expected, I'm just confused
 about token range assignments in a Multi-DC environment and how repairs
 should work

 From
 http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html,
 it specifies

 *"Multiple data center deployments: calculate the tokens for
 each data center so that the hash range is evenly divided for the nodes in
 each data center"*

 Given that nodetool -repair isn't multi-dc aware, in our production 18
 node cluster (9 nodes in each DC), which of the following token ranges
 should be used (Murmur3 Partitioner) ?

 Token range divided evenly over the 2 DC's/18 nodes as below ?

 Node DC_NSWDC_VIC
 1'-9223372036854775808''-8198552921648689608'
 2'-7173733806442603408''-6148914691236517208'
 3'-5124095576030431008''-4099276460824344808'
 4'-3074457345618258608''-2049638230412172408'
 5'-1024819115206086208''-8'
 6'1024819115206086192' '2049638230412172392'
 7'3074457345618258592' '4099276460824344792'
 8'5124095576030430992' '6148914691236517192'
 9'7173733806442603392' '8198552921648689592'

 Or An offset used for DC_VIC (i.e. DC_NSW + 100) ?

 Node DC_NSW DC_VIC
 1 '-9223372036854775808''-9223372036854775708'
 2 '-7173733806442603407''-7173733806442603307'
 3 '-5124095576030431006''-5124095576030430906'
 4 '-3074457345618258605''-3074457345618258505'
 5 '-1024819115206086204''-1024819115206086104'
 6 '1024819115206086197' '1024819115206086297'
 7 '3074457345618258598' '3074457345618258698'
 8 '5124095576030430999' '5124095576030431099'
 9 '7173733806442603400' '7173733806442603500'

 It's too late for me to switch to vnodes, hope that makes sense, thanks

 Matt



 On Thu, May 29, 2014 at 12:01 AM, Rameez Thonnakkal >>> > wrote:

> as Chovatia mentioned, the keyspaces seems to be different.
> try "Describe keyspace SN_KEYSPACE" and "describe keyspace
> MY_KEYSPACE" from CQL.
> This will give you an idea about how many replicas are there for these
> keyspaces.
>
>
>
> On Wed, May 28, 2014 at 11:49 AM, chovatia jaydeep <
> chovatia_jayd...@yahoo.co.in> wrote:
>
>> What is your partition type? Is
>> it org.apache.cassandra.dht.Murmur3Partitioner?
>> In your repair command i do see there are two different KeySpaces 
>> "MY_KEYSPACE"
>> and "SN_KEYSPACE", are these two separate key spaces or typo?
>>
>> -jaydeep
>>
>>
>>   On Tuesday, 27 May 2014 10:26 PM, Matthew Allen <
>> matthew.j.al...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> Am a bit confused regarding data ownership in a multi-dc environment.
>>
>> I have the following setup in a test cluster with a keyspace with
>> (placement_strategy = 'NetworkTopologyStrategy' and strategy_options =
>> {'DC_NSW':2,'DC_VIC':2};)
>>
>> Datacenter: DC_NSW
>> ==

Re: efficiently generate complete database dump in text format

2014-10-09 Thread Paulo Ricardo Motta Gomes
The best way to generate dumps from Cassandra is via Hadoop integration (or
spark). You can find more info here:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html
http://wiki.apache.org/cassandra/HadoopSupport

On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar 
wrote:

> Hi,
>We have a Cassandra database column family containing 320 millions rows
> and each row contains about 15 columns. We want to take monthly dump of
> this single column family contained in this database in text format.
>
> We are planning to take following approach to implement this functionality
> 1. Take a snapshot of Cassandra database using nodetool utility. We
> specify -cf flag to
>  specify column family name so that snapshot contains data
> corresponding to a single
>  column family.
> 2. We take backup of this snapshot and move this backup to a separate
> physical machine.
> 3. We using "SStable to json conversion" utility to json convert all the
> data files into json
> format.
>
> We have following questions/doubts regarding the above approach
> a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json
> record
>  and can I safely ignore all such json records?
> b) If I ignore all records marked by "d" flag, than can generated json
> files in step 3, contain
> duplicate records? I mean do multiple entries for same key.
>
> Do there can be any other better approach to generate data dumps in text
> format.
>
> Regards,
> Gaurav
>



-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br *
+55 48 3232.3200


Re: efficiently generate complete database dump in text format

2014-10-09 Thread Daniel Chia
You might also want to consider tools like
https://github.com/Netflix/aegisthus for the last step, which can help you
deal with tombstones and de-duplicate data.

Thanks,
Daniel

On Thu, Oct 9, 2014 at 12:19 AM, Gaurav Bhatnagar 
wrote:

> Hi,
>We have a Cassandra database column family containing 320 millions rows
> and each row contains about 15 columns. We want to take monthly dump of
> this single column family contained in this database in text format.
>
> We are planning to take following approach to implement this functionality
> 1. Take a snapshot of Cassandra database using nodetool utility. We
> specify -cf flag to
>  specify column family name so that snapshot contains data
> corresponding to a single
>  column family.
> 2. We take backup of this snapshot and move this backup to a separate
> physical machine.
> 3. We using "SStable to json conversion" utility to json convert all the
> data files into json
> format.
>
> We have following questions/doubts regarding the above approach
> a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json
> record
>  and can I safely ignore all such json records?
> b) If I ignore all records marked by "d" flag, than can generated json
> files in step 3, contain
> duplicate records? I mean do multiple entries for same key.
>
> Do there can be any other better approach to generate data dumps in text
> format.
>
> Regards,
> Gaurav
>


Disabling compaction

2014-10-09 Thread Parag Shah
Hi all,

 I am trying to disable compaction for a few select tables. Here is a 
definition of one such table:

CREATE TABLE blob_2014_12_31 (
  blob_id uuid,
  blob_index int,
  blob_chunk blob,
  PRIMARY KEY (blob_id, blob_index)
) WITH
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'enabled': 'false', 'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor’};

I have set compaction ‘enabled’ : ‘false’ on the above table.

However, I do see compactions being run for this node:

-bash-3.2$ nodetool compactionstats
pending tasks: 55
  compaction typekeyspace   table   completed   
total  unit  progress
   Compaction ids_high_awslab blob_2014_11_15 18122816990 
35814893020 bytes50.60%
   Compaction ids_high_awslab blob_2014_12_31 18576750966 
34242866468 bytes54.25%
   Compaction ids_high_awslab blob_2014_12_15 19213914904 
35956698600 bytes53.44%
Active compaction remaining time :   0h49m46s

Can you someone tell me why this is happening? Do I need to set the compaction 
threshold  to 0 0?

Regards
Parag


Deleting counters

2014-10-09 Thread Robert Wille
At the Cassandra Summit I became aware of that there are issues with deleting 
counters. I have a few questions about that. What is the bad thing that happens 
(or can possibly happen) when a counter is deleted? Is it safe to delete an 
entire row of counters? Is there any 2.0.x version of Cassandra in which it is 
safe to delete counters? Is there an access pattern in which it is safe to 
delete counters in 2.0.x?

Thanks

Robert



Re: Deleting counters

2014-10-09 Thread Robert Coli
On Thu, Oct 9, 2014 at 3:29 PM, Robert Wille  wrote:

> What is the bad thing that happens (or can possibly happen) when a counter
> is deleted?
>

You get totally wacky counter state if you, later, re-create and use it.

https://issues.apache.org/jira/browse/CASSANDRA-6532
and
https://issues.apache.org/jira/browse/CASSANDRA-7346

Have some good discussion.

Is it safe to delete an entire row of counters?


Not unless :
a) you will never use that particular counter row again
OR
b) gc_grace_seconds has passed and you have repaired and run a major
compaction on every node, such that 100% of such tombstones have been
removed


> Is there any 2.0.x version of Cassandra in which it is safe to delete
> counters?


See above, 2.0 is the same as any other version for this behavior.

Is there an access pattern in which it is safe to delete counters in 2.0.x?
>

The summary of the above is that, in practice, it is only safe to delete
counters if you plan to never use that particular counter row again.

=Rob


Re: Disabling compaction

2014-10-09 Thread Marcus Eriksson
what version are you on?

On Thu, Oct 9, 2014 at 10:33 PM, Parag Shah  wrote:

>  Hi all,
>
>   I am trying to disable compaction for a few select tables. Here is
> a definition of one such table:
>
>  CREATE TABLE blob_2014_12_31 (
>   blob_id uuid,
>   blob_index int,
>   blob_chunk blob,
>   PRIMARY KEY (blob_id, blob_index)
> ) WITH
>   bloom_filter_fp_chance=0.01 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.00 AND
>   gc_grace_seconds=864000 AND
>   index_interval=128 AND
>   read_repair_chance=0.10 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='99.0PERCENTILE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'enabled': 'false', 'class': 'SizeTieredCompactionStrategy'}
> AND
>   compression={'sstable_compression': 'LZ4Compressor’};
>
>  I have set compaction ‘enabled’ : ‘false’ on the above table.
>
>  However, I do see compactions being run for this node:
>
>  -bash-3.2$ nodetool compactionstats
> pending tasks: 55
>   compaction typekeyspace   table   completed
>   total  unit  progress
>Compaction ids_high_awslab blob_2014_11_15 18122816990
> 35814893020 bytes50.60%
>Compaction ids_high_awslab blob_2014_12_31 18576750966
> 34242866468 bytes54.25%
>Compaction ids_high_awslab blob_2014_12_15 19213914904
> 35956698600 bytes53.44%
> Active compaction remaining time :   0h49m46s
>
>  Can you someone tell me why this is happening? Do I need to set the
> compaction threshold  to 0 0?
>
>  Regards
> Parag
>