Re: Strange issue wherein cassandra not being started from cron
Hi Hannu. On Wed, Jan 11, 2017 at 8:31 PM, Hannu Kröger <hkro...@gmail.com> wrote: > One possible reason is that cassandra process gets different user when run > differently. Check who owns the data files and check also what gets written > into the /var/log/cassandra/system.log (or whatever that was). > Absolutely nothing gets written to /var/log/cassandra/system.log (when trying to invoke cassandra via cron). > > Hannu > > > On 11 Jan 2017, at 16.42, Ajay Garg <ajaygargn...@gmail.com> wrote: > > Tried everything. > Every other cron job/script I try works, just the cassandra-service does > not. > > On Wed, Jan 11, 2017 at 8:51 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >> >> >> On Tuesday, January 10, 2017, Jonathan Haddad <j...@jonhaddad.com> wrote: >> >>> Last I checked, cron doesn't load the same, full environment you see >>> when you log in. Also, why put Cassandra on a cron? >>> On Mon, Jan 9, 2017 at 9:47 PM Bhuvan Rawal <bhu1ra...@gmail.com> wrote: >>> >>>> Hi Ajay, >>>> >>>> Have you had a look at cron logs? - mine is in path /var/log/cron >>>> >>>> Thanks & Regards, >>>> >>>> On Tue, Jan 10, 2017 at 9:45 AM, Ajay Garg <ajaygargn...@gmail.com> >>>> wrote: >>>> >>>>> Hi All. >>>>> >>>>> Facing a very weird issue, wherein the command >>>>> >>>>> */etc/init.d/cassandra start* >>>>> >>>>> causes cassandra to start when the command is run from command-line. >>>>> >>>>> >>>>> However, if I put the above as a cron job >>>>> >>>>> >>>>> >>>>> ** * * * * /etc/init.d/cassandra start* >>>>> cassandra never starts. >>>>> >>>>> >>>>> I have checked, and "cron" service is running. >>>>> >>>>> >>>>> Any ideas what might be wrong? >>>>> I am pasting the cassandra script for brevity. >>>>> >>>>> >>>>> Thanks and Regards, >>>>> Ajay >>>>> >>>>> >>>>> >>>>> >>>>> #! /bin/sh >>>>> ### BEGIN INIT INFO >>>>> # Provides: cassandra >>>>> # Required-Start:$remote_fs $network $named $time >>>>> # Required-Stop: $remote_fs $network $named $time >>>>> # Should-Start: ntp mdadm >>>>> # Should-Stop: ntp mdadm >>>>> # Default-Start: 2 3 4 5 >>>>> # Default-Stop: 0 1 6 >>>>> # Short-Description: distributed storage system for structured data >>>>> # Description: Cassandra is a distributed (peer-to-peer) system >>>>> for >>>>> #the management and storage of structured data. >>>>> ### END INIT INFO >>>>> >>>>> # Author: Eric Evans <eev...@racklabs.com> >>>>> >>>>> DESC="Cassandra" >>>>> NAME=cassandra >>>>> PIDFILE=/var/run/$NAME/$NAME.pid >>>>> SCRIPTNAME=/etc/init.d/$NAME >>>>> CONFDIR=/etc/cassandra >>>>> WAIT_FOR_START=10 >>>>> CASSANDRA_HOME=/usr/share/cassandra >>>>> FD_LIMIT=10 >>>>> >>>>> [ -e /usr/share/cassandra/apache-cassandra.jar ] || exit 0 >>>>> [ -e /etc/cassandra/cassandra.yaml ] || exit 0 >>>>> [ -e /etc/cassandra/cassandra-env.sh ] || exit 0 >>>>> >>>>> # Read configuration variable file if it is present >>>>> [ -r /etc/default/$NAME ] && . /etc/default/$NAME >>>>> >>>>> # Read Cassandra environment file. >>>>> . /etc/cassandra/cassandra-env.sh >>>>> >>>>> if [ -z "$JVM_OPTS" ]; then >>>>> echo "Initialization failed; \$JVM_OPTS not set!" >&2 >>>>> exit 3 >>>>> fi >>>>> >>>>> export JVM_OPTS >>>>> >>>>> # Export JAVA_HOME, if set. >>>>> [ -n "$JAVA_HOME" ] && export JAVA_HOME >>>>> >>>>> # Load the VERBOSE setting and other rcS variables >>>>>
Re: Strange issue wherein cassandra not being started from cron
On Wed, Jan 11, 2017 at 8:29 PM, Martin Schröder <mar...@oneiros.de> wrote: > 2017-01-11 15:42 GMT+01:00 Ajay Garg <ajaygargn...@gmail.com>: > > Tried everything. > > Then try >service cassandra start > or >systemctl start cassandra > > You still haven't explained to us why you want to start cassandra every > minute. > Hi Martin. Sometimes, the cassandra-process gets killed (reason unknown as of now). Doing a manual "service cassandra start" works then. Adding this in cron would at least ensure that the maximum downtime is 59 seconds (till the time root-cause of cassandra-crashing is known). > > Best >Martin > -- Regards, Ajay
Re: Strange issue wherein cassandra not being started from cron
Tried everything. Every other cron job/script I try works, just the cassandra-service does not. On Wed, Jan 11, 2017 at 8:51 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > > > On Tuesday, January 10, 2017, Jonathan Haddad <j...@jonhaddad.com> wrote: > >> Last I checked, cron doesn't load the same, full environment you see when >> you log in. Also, why put Cassandra on a cron? >> On Mon, Jan 9, 2017 at 9:47 PM Bhuvan Rawal <bhu1ra...@gmail.com> wrote: >> >>> Hi Ajay, >>> >>> Have you had a look at cron logs? - mine is in path /var/log/cron >>> >>> Thanks & Regards, >>> >>> On Tue, Jan 10, 2017 at 9:45 AM, Ajay Garg <ajaygargn...@gmail.com> >>> wrote: >>> >>>> Hi All. >>>> >>>> Facing a very weird issue, wherein the command >>>> >>>> */etc/init.d/cassandra start* >>>> >>>> causes cassandra to start when the command is run from command-line. >>>> >>>> >>>> However, if I put the above as a cron job >>>> >>>> >>>> >>>> ** * * * * /etc/init.d/cassandra start* >>>> cassandra never starts. >>>> >>>> >>>> I have checked, and "cron" service is running. >>>> >>>> >>>> Any ideas what might be wrong? >>>> I am pasting the cassandra script for brevity. >>>> >>>> >>>> Thanks and Regards, >>>> Ajay >>>> >>>> >>>> >>>> >>>> #! /bin/sh >>>> ### BEGIN INIT INFO >>>> # Provides: cassandra >>>> # Required-Start:$remote_fs $network $named $time >>>> # Required-Stop: $remote_fs $network $named $time >>>> # Should-Start: ntp mdadm >>>> # Should-Stop: ntp mdadm >>>> # Default-Start: 2 3 4 5 >>>> # Default-Stop: 0 1 6 >>>> # Short-Description: distributed storage system for structured data >>>> # Description: Cassandra is a distributed (peer-to-peer) system >>>> for >>>> #the management and storage of structured data. >>>> ### END INIT INFO >>>> >>>> # Author: Eric Evans <eev...@racklabs.com> >>>> >>>> DESC="Cassandra" >>>> NAME=cassandra >>>> PIDFILE=/var/run/$NAME/$NAME.pid >>>> SCRIPTNAME=/etc/init.d/$NAME >>>> CONFDIR=/etc/cassandra >>>> WAIT_FOR_START=10 >>>> CASSANDRA_HOME=/usr/share/cassandra >>>> FD_LIMIT=10 >>>> >>>> [ -e /usr/share/cassandra/apache-cassandra.jar ] || exit 0 >>>> [ -e /etc/cassandra/cassandra.yaml ] || exit 0 >>>> [ -e /etc/cassandra/cassandra-env.sh ] || exit 0 >>>> >>>> # Read configuration variable file if it is present >>>> [ -r /etc/default/$NAME ] && . /etc/default/$NAME >>>> >>>> # Read Cassandra environment file. >>>> . /etc/cassandra/cassandra-env.sh >>>> >>>> if [ -z "$JVM_OPTS" ]; then >>>> echo "Initialization failed; \$JVM_OPTS not set!" >&2 >>>> exit 3 >>>> fi >>>> >>>> export JVM_OPTS >>>> >>>> # Export JAVA_HOME, if set. >>>> [ -n "$JAVA_HOME" ] && export JAVA_HOME >>>> >>>> # Load the VERBOSE setting and other rcS variables >>>> . /lib/init/vars.sh >>>> >>>> # Define LSB log_* functions. >>>> # Depend on lsb-base (>= 3.0-6) to ensure that this file is present. >>>> . /lib/lsb/init-functions >>>> >>>> # >>>> # Function that returns 0 if process is running, or nonzero if not. >>>> # >>>> # The nonzero value is 3 if the process is simply not running, and 1 if >>>> the >>>> # process is not running but the pidfile exists (to match the exit >>>> codes for >>>> # the "status" command; see LSB core spec 3.1, section 20.2) >>>> # >>>> CMD_PATT="cassandra.+CassandraDaemon" >>>> is_running() >>>> { >>>> if [ -f $PIDFILE ]; then >>>> pid=`cat $PIDFILE` >>>> grep -Eq "$CMD_PATT" &
Strange issue wherein cassandra not being started from cron
Hi All. Facing a very weird issue, wherein the command */etc/init.d/cassandra start* causes cassandra to start when the command is run from command-line. However, if I put the above as a cron job ** * * * * /etc/init.d/cassandra start* cassandra never starts. I have checked, and "cron" service is running. Any ideas what might be wrong? I am pasting the cassandra script for brevity. Thanks and Regards, Ajay #! /bin/sh ### BEGIN INIT INFO # Provides: cassandra # Required-Start:$remote_fs $network $named $time # Required-Stop: $remote_fs $network $named $time # Should-Start: ntp mdadm # Should-Stop: ntp mdadm # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Short-Description: distributed storage system for structured data # Description: Cassandra is a distributed (peer-to-peer) system for #the management and storage of structured data. ### END INIT INFO # Author: Eric Evans <eev...@racklabs.com> DESC="Cassandra" NAME=cassandra PIDFILE=/var/run/$NAME/$NAME.pid SCRIPTNAME=/etc/init.d/$NAME CONFDIR=/etc/cassandra WAIT_FOR_START=10 CASSANDRA_HOME=/usr/share/cassandra FD_LIMIT=10 [ -e /usr/share/cassandra/apache-cassandra.jar ] || exit 0 [ -e /etc/cassandra/cassandra.yaml ] || exit 0 [ -e /etc/cassandra/cassandra-env.sh ] || exit 0 # Read configuration variable file if it is present [ -r /etc/default/$NAME ] && . /etc/default/$NAME # Read Cassandra environment file. . /etc/cassandra/cassandra-env.sh if [ -z "$JVM_OPTS" ]; then echo "Initialization failed; \$JVM_OPTS not set!" >&2 exit 3 fi export JVM_OPTS # Export JAVA_HOME, if set. [ -n "$JAVA_HOME" ] && export JAVA_HOME # Load the VERBOSE setting and other rcS variables . /lib/init/vars.sh # Define LSB log_* functions. # Depend on lsb-base (>= 3.0-6) to ensure that this file is present. . /lib/lsb/init-functions # # Function that returns 0 if process is running, or nonzero if not. # # The nonzero value is 3 if the process is simply not running, and 1 if the # process is not running but the pidfile exists (to match the exit codes for # the "status" command; see LSB core spec 3.1, section 20.2) # CMD_PATT="cassandra.+CassandraDaemon" is_running() { if [ -f $PIDFILE ]; then pid=`cat $PIDFILE` grep -Eq "$CMD_PATT" "/proc/$pid/cmdline" 2>/dev/null && return 0 return 1 fi return 3 } # # Function that starts the daemon/service # do_start() { # Return # 0 if daemon has been started # 1 if daemon was already running # 2 if daemon could not be started ulimit -l unlimited ulimit -n "$FD_LIMIT" cassandra_home=`getent passwd cassandra | awk -F ':' '{ print $6; }'` heap_dump_f="$cassandra_home/java_`date +%s`.hprof" error_log_f="$cassandra_home/hs_err_`date +%s`.log" [ -e `dirname "$PIDFILE"` ] || \ install -d -ocassandra -gcassandra -m755 `dirname $PIDFILE` start-stop-daemon -S -c cassandra -a /usr/sbin/cassandra -q -p "$PIDFILE" -t >/dev/null || return 1 start-stop-daemon -S -c cassandra -a /usr/sbin/cassandra -b -p "$PIDFILE" -- \ -p "$PIDFILE" -H "$heap_dump_f" -E "$error_log_f" >/dev/null || return 2 } # # Function that stops the daemon/service # do_stop() { # Return # 0 if daemon has been stopped # 1 if daemon was already stopped # 2 if daemon could not be stopped # other if a failure occurred start-stop-daemon -K -p "$PIDFILE" -R TERM/30/KILL/5 >/dev/null RET=$? rm -f "$PIDFILE" return $RET } case "$1" in start) [ "$VERBOSE" != no ] && log_daemon_msg "Starting $DESC" "$NAME" do_start case "$?" in 0|1) [ "$VERBOSE" != no ] && log_end_msg 0 ;; 2) [ "$VERBOSE" != no ] && log_end_msg 1 ;; esac ;; stop) [ "$VERBOSE" != no ] && log_daemon_msg "Stopping $DESC" "$NAME" do_stop case "$?" in 0|1) [ "$VERBOSE" != no ] && log_end_msg 0 ;; 2) [ "$VERBOSE" != no ] && log_end_msg 1 ;; esac ;; restart|force-reload) log_daemon_msg "Restarting $DESC" "$NAME" do_stop case "$?" in 0|1) do_start case "$?" in 0|1) do_start case "$?" in 0) log_end_msg 0 ;; 1) log_
Re: Basic query in setting up secure inter-dc cluster
Hi Everyone. Kindly reply in "yes" or "no", as to whether it is possible to setup encryption only between particular pair of nodes? Or is it an "all" or "none" feature, where encryption is present between EVERY PAIR of nodes, or in NO PAIR of nodes. Thanks and Regards, Ajay On Mon, Apr 18, 2016 at 9:55 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Also, wondering what is the difference between "all" and "dc" in > "internode_encryption". > Perhaps my answer lies in this? > > On Mon, Apr 18, 2016 at 9:51 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Ok, trying to wake up this thread again. >> >> I went through the following links :: >> >> >> https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html >> >> https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html >> >> >> and I am wondering *if it is possible to setup secure >> inter-communication only between some nodes*. >> >> In particular, if I have a 2*2 cluster, is it possible to setup secure >> communication ONLY between the nodes of DC2? >> Once it works well, we would then setup secure-communication everywhere. >> >> We are wanting this, because DC2 is the backup centre, while DC1 is the >> primary-centre connected directly to the application-server. We don't want >> to screw things if something goes bad in DC1. >> >> >> Will be grateful for pointers. >> >> >> Thanks and Regards, >> Ajay >> >> On Sun, Jan 17, 2016 at 9:09 PM, Ajay Garg <ajaygargn...@gmail.com> >> wrote: >> >>> Hi All. >>> >>> A gentle query-reminder. >>> >>> I will be grateful if I could be given a brief technical overview, as to >>> how secure-communication occurs between two nodes in a cluster. >>> >>> Please note that I wish for some information on the "how it works below >>> the hood", and NOT "how to set it up". >>> >>> >>> >>> Thanks and Regards, >>> Ajay >>> >>> On Wed, Jan 6, 2016 at 4:16 PM, Ajay Garg <ajaygargn...@gmail.com> >>> wrote: >>> >>>> Thanks everyone for the reply. >>>> >>>> I actually have a fair bit of questions, but it will be nice if someone >>>> could please tell me the flow (implementation-wise), as to how node-to-node >>>> encryption works in a cluster. >>>> >>>> Let's say node1 from DC1, wishes to talk securely to node 2 from DC2 >>>> (with *"require_client_auth: false*"). >>>> I presume it would be like below (please correct me if am wrong) :: >>>> >>>> a) >>>> node1 tries to connect to node2, using the certificate *as defined on >>>> node1* in cassandra.yaml. >>>> >>>> b) >>>> node2 will confirm if the certificate being offered by node1 is in the >>>> truststore *as defined on node2* in cassandra.yaml. >>>> if it is, secure-communication is allowed. >>>> >>>> >>>> Is my thinking right? >>>> I >>>> >>>> On Wed, Jan 6, 2016 at 1:55 PM, Neha Dave <nehajtriv...@gmail.com> >>>> wrote: >>>> >>>>> Hi Ajay, >>>>> Have a look here : >>>>> https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html >>>>> >>>>> You can configure for DC level Security: >>>>> >>>>> Procedure >>>>> >>>>> On each node under sever_encryption_options: >>>>> >>>>>- Enable internode_encryption. >>>>>The available options are: >>>>> - all >>>>> - none >>>>> - dc: Cassandra encrypts the traffic between the data centers. >>>>> - rack: Cassandra encrypts the traffic between the racks. >>>>> >>>>> regards >>>>> >>>>> Neha >>>>> >>>>> >>>>> >>>>> On Wed, Jan 6, 2016 at 12:48 PM, Singh, Abhijeet < >>>>> absi...@informatica.com> wrote: >>>>> >>>>>> Security is a very wide concept. What exactly do you want to achieve ? >>>>>> >>>>>> >>>>>> >>>>>> *From:* Ajay Garg [mailto:ajaygargn...@gmail.com] >>>>>> *Sent:* Wednesday, January 06, 2016 11:27 AM >>>>>> *To:* user@cassandra.apache.org >>>>>> *Subject:* Basic query in setting up secure inter-dc cluster >>>>>> >>>>>> >>>>>> >>>>>> Hi All. >>>>>> >>>>>> We have a 2*2 cluster deployed, but no security as of now. >>>>>> >>>>>> As a first stage, we wish to implement inter-dc security. >>>>>> >>>>>> Is it possible to enable security one machine at a time? >>>>>> >>>>>> For example, let's say the machines are DC1M1, DC1M2, DC2M1, DC2M2. >>>>>> >>>>>> If I make the changes JUST IN DC2M2 and restart it, will the traffic >>>>>> between DC1M1/DC1M2 and DC2M2 be secure? Or security will kick in ONLY >>>>>> AFTER the changes are made in all the 4 machines? >>>>>> >>>>>> Asking here, because I don't want to screw up a live cluster due to >>>>>> my lack of experience. >>>>>> >>>>>> Looking forward to some pointers. >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Regards, >>>>>> Ajay >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Ajay >>>> >>> >>> >>> >>> -- >>> Regards, >>> Ajay >>> >> >> >> >> -- >> Regards, >> Ajay >> > > > > -- > Regards, > Ajay > -- Regards, Ajay
Re: Basic query in setting up secure inter-dc cluster
Also, wondering what is the difference between "all" and "dc" in "internode_encryption". Perhaps my answer lies in this? On Mon, Apr 18, 2016 at 9:51 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Ok, trying to wake up this thread again. > > I went through the following links :: > > > https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html > > https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html > > > and I am wondering *if it is possible to setup secure inter-communication > only between some nodes*. > > In particular, if I have a 2*2 cluster, is it possible to setup secure > communication ONLY between the nodes of DC2? > Once it works well, we would then setup secure-communication everywhere. > > We are wanting this, because DC2 is the backup centre, while DC1 is the > primary-centre connected directly to the application-server. We don't want > to screw things if something goes bad in DC1. > > > Will be grateful for pointers. > > > Thanks and Regards, > Ajay > > On Sun, Jan 17, 2016 at 9:09 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Hi All. >> >> A gentle query-reminder. >> >> I will be grateful if I could be given a brief technical overview, as to >> how secure-communication occurs between two nodes in a cluster. >> >> Please note that I wish for some information on the "how it works below >> the hood", and NOT "how to set it up". >> >> >> >> Thanks and Regards, >> Ajay >> >> On Wed, Jan 6, 2016 at 4:16 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: >> >>> Thanks everyone for the reply. >>> >>> I actually have a fair bit of questions, but it will be nice if someone >>> could please tell me the flow (implementation-wise), as to how node-to-node >>> encryption works in a cluster. >>> >>> Let's say node1 from DC1, wishes to talk securely to node 2 from DC2 >>> (with *"require_client_auth: false*"). >>> I presume it would be like below (please correct me if am wrong) :: >>> >>> a) >>> node1 tries to connect to node2, using the certificate *as defined on >>> node1* in cassandra.yaml. >>> >>> b) >>> node2 will confirm if the certificate being offered by node1 is in the >>> truststore *as defined on node2* in cassandra.yaml. >>> if it is, secure-communication is allowed. >>> >>> >>> Is my thinking right? >>> I >>> >>> On Wed, Jan 6, 2016 at 1:55 PM, Neha Dave <nehajtriv...@gmail.com> >>> wrote: >>> >>>> Hi Ajay, >>>> Have a look here : >>>> https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html >>>> >>>> You can configure for DC level Security: >>>> >>>> Procedure >>>> >>>> On each node under sever_encryption_options: >>>> >>>>- Enable internode_encryption. >>>>The available options are: >>>> - all >>>> - none >>>> - dc: Cassandra encrypts the traffic between the data centers. >>>> - rack: Cassandra encrypts the traffic between the racks. >>>> >>>> regards >>>> >>>> Neha >>>> >>>> >>>> >>>> On Wed, Jan 6, 2016 at 12:48 PM, Singh, Abhijeet < >>>> absi...@informatica.com> wrote: >>>> >>>>> Security is a very wide concept. What exactly do you want to achieve ? >>>>> >>>>> >>>>> >>>>> *From:* Ajay Garg [mailto:ajaygargn...@gmail.com] >>>>> *Sent:* Wednesday, January 06, 2016 11:27 AM >>>>> *To:* user@cassandra.apache.org >>>>> *Subject:* Basic query in setting up secure inter-dc cluster >>>>> >>>>> >>>>> >>>>> Hi All. >>>>> >>>>> We have a 2*2 cluster deployed, but no security as of now. >>>>> >>>>> As a first stage, we wish to implement inter-dc security. >>>>> >>>>> Is it possible to enable security one machine at a time? >>>>> >>>>> For example, let's say the machines are DC1M1, DC1M2, DC2M1, DC2M2. >>>>> >>>>> If I make the changes JUST IN DC2M2 and restart it, will the traffic >>>>> between DC1M1/DC1M2 and DC2M2 be secure? Or security will kick in ONLY >>>>> AFTER the changes are made in all the 4 machines? >>>>> >>>>> Asking here, because I don't want to screw up a live cluster due to my >>>>> lack of experience. >>>>> >>>>> Looking forward to some pointers. >>>>> >>>>> >>>>> -- >>>>> >>>>> Regards, >>>>> Ajay >>>>> >>>> >>>> >>> >>> >>> -- >>> Regards, >>> Ajay >>> >> >> >> >> -- >> Regards, >> Ajay >> > > > > -- > Regards, > Ajay > -- Regards, Ajay
Re: Basic query in setting up secure inter-dc cluster
Ok, trying to wake up this thread again. I went through the following links :: https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html and I am wondering *if it is possible to setup secure inter-communication only between some nodes*. In particular, if I have a 2*2 cluster, is it possible to setup secure communication ONLY between the nodes of DC2? Once it works well, we would then setup secure-communication everywhere. We are wanting this, because DC2 is the backup centre, while DC1 is the primary-centre connected directly to the application-server. We don't want to screw things if something goes bad in DC1. Will be grateful for pointers. Thanks and Regards, Ajay On Sun, Jan 17, 2016 at 9:09 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Hi All. > > A gentle query-reminder. > > I will be grateful if I could be given a brief technical overview, as to > how secure-communication occurs between two nodes in a cluster. > > Please note that I wish for some information on the "how it works below > the hood", and NOT "how to set it up". > > > > Thanks and Regards, > Ajay > > On Wed, Jan 6, 2016 at 4:16 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Thanks everyone for the reply. >> >> I actually have a fair bit of questions, but it will be nice if someone >> could please tell me the flow (implementation-wise), as to how node-to-node >> encryption works in a cluster. >> >> Let's say node1 from DC1, wishes to talk securely to node 2 from DC2 >> (with *"require_client_auth: false*"). >> I presume it would be like below (please correct me if am wrong) :: >> >> a) >> node1 tries to connect to node2, using the certificate *as defined on >> node1* in cassandra.yaml. >> >> b) >> node2 will confirm if the certificate being offered by node1 is in the >> truststore *as defined on node2* in cassandra.yaml. >> if it is, secure-communication is allowed. >> >> >> Is my thinking right? >> I >> >> On Wed, Jan 6, 2016 at 1:55 PM, Neha Dave <nehajtriv...@gmail.com> wrote: >> >>> Hi Ajay, >>> Have a look here : >>> https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html >>> >>> You can configure for DC level Security: >>> >>> Procedure >>> >>> On each node under sever_encryption_options: >>> >>>- Enable internode_encryption. >>>The available options are: >>> - all >>> - none >>> - dc: Cassandra encrypts the traffic between the data centers. >>> - rack: Cassandra encrypts the traffic between the racks. >>> >>> regards >>> >>> Neha >>> >>> >>> >>> On Wed, Jan 6, 2016 at 12:48 PM, Singh, Abhijeet < >>> absi...@informatica.com> wrote: >>> >>>> Security is a very wide concept. What exactly do you want to achieve ? >>>> >>>> >>>> >>>> *From:* Ajay Garg [mailto:ajaygargn...@gmail.com] >>>> *Sent:* Wednesday, January 06, 2016 11:27 AM >>>> *To:* user@cassandra.apache.org >>>> *Subject:* Basic query in setting up secure inter-dc cluster >>>> >>>> >>>> >>>> Hi All. >>>> >>>> We have a 2*2 cluster deployed, but no security as of now. >>>> >>>> As a first stage, we wish to implement inter-dc security. >>>> >>>> Is it possible to enable security one machine at a time? >>>> >>>> For example, let's say the machines are DC1M1, DC1M2, DC2M1, DC2M2. >>>> >>>> If I make the changes JUST IN DC2M2 and restart it, will the traffic >>>> between DC1M1/DC1M2 and DC2M2 be secure? Or security will kick in ONLY >>>> AFTER the changes are made in all the 4 machines? >>>> >>>> Asking here, because I don't want to screw up a live cluster due to my >>>> lack of experience. >>>> >>>> Looking forward to some pointers. >>>> >>>> >>>> -- >>>> >>>> Regards, >>>> Ajay >>>> >>> >>> >> >> >> -- >> Regards, >> Ajay >> > > > > -- > Regards, > Ajay > -- Regards, Ajay
Can we set TTL on individual fields (columns) using the Datastax java-driver
Something like :: ## class A { @Id @Column (name = "pojo_key") int key; @Ttl(10) @Column (name = "pojo_temporary_guest") String guest; } ## When I persist, let's say value "ajay" in guest-field (pojo_temporary_guest column), it stays forever, and does not become "null" after 10 seconds. Kindly point me what I am doing wrong. I will be grateful. Thanks and Regards, Ajay
Re: Basic query in setting up secure inter-dc cluster
Hi All. A gentle query-reminder. I will be grateful if I could be given a brief technical overview, as to how secure-communication occurs between two nodes in a cluster. Please note that I wish for some information on the "how it works below the hood", and NOT "how to set it up". Thanks and Regards, Ajay On Wed, Jan 6, 2016 at 4:16 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Thanks everyone for the reply. > > I actually have a fair bit of questions, but it will be nice if someone > could please tell me the flow (implementation-wise), as to how node-to-node > encryption works in a cluster. > > Let's say node1 from DC1, wishes to talk securely to node 2 from DC2 (with > *"require_client_auth: > false*"). > I presume it would be like below (please correct me if am wrong) :: > > a) > node1 tries to connect to node2, using the certificate *as defined on > node1* in cassandra.yaml. > > b) > node2 will confirm if the certificate being offered by node1 is in the > truststore *as defined on node2* in cassandra.yaml. > if it is, secure-communication is allowed. > > > Is my thinking right? > I > > On Wed, Jan 6, 2016 at 1:55 PM, Neha Dave <nehajtriv...@gmail.com> wrote: > >> Hi Ajay, >> Have a look here : >> https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html >> >> You can configure for DC level Security: >> >> Procedure >> >> On each node under sever_encryption_options: >> >>- Enable internode_encryption. >>The available options are: >> - all >> - none >> - dc: Cassandra encrypts the traffic between the data centers. >> - rack: Cassandra encrypts the traffic between the racks. >> >> regards >> >> Neha >> >> >> >> On Wed, Jan 6, 2016 at 12:48 PM, Singh, Abhijeet <absi...@informatica.com >> > wrote: >> >>> Security is a very wide concept. What exactly do you want to achieve ? >>> >>> >>> >>> *From:* Ajay Garg [mailto:ajaygargn...@gmail.com] >>> *Sent:* Wednesday, January 06, 2016 11:27 AM >>> *To:* user@cassandra.apache.org >>> *Subject:* Basic query in setting up secure inter-dc cluster >>> >>> >>> >>> Hi All. >>> >>> We have a 2*2 cluster deployed, but no security as of now. >>> >>> As a first stage, we wish to implement inter-dc security. >>> >>> Is it possible to enable security one machine at a time? >>> >>> For example, let's say the machines are DC1M1, DC1M2, DC2M1, DC2M2. >>> >>> If I make the changes JUST IN DC2M2 and restart it, will the traffic >>> between DC1M1/DC1M2 and DC2M2 be secure? Or security will kick in ONLY >>> AFTER the changes are made in all the 4 machines? >>> >>> Asking here, because I don't want to screw up a live cluster due to my >>> lack of experience. >>> >>> Looking forward to some pointers. >>> >>> >>> -- >>> >>> Regards, >>> Ajay >>> >> >> > > > -- > Regards, > Ajay > -- Regards, Ajay
Re: Basic query in setting up secure inter-dc cluster
Thanks everyone for the reply. I actually have a fair bit of questions, but it will be nice if someone could please tell me the flow (implementation-wise), as to how node-to-node encryption works in a cluster. Let's say node1 from DC1, wishes to talk securely to node 2 from DC2 (with *"require_client_auth: false*"). I presume it would be like below (please correct me if am wrong) :: a) node1 tries to connect to node2, using the certificate *as defined on node1* in cassandra.yaml. b) node2 will confirm if the certificate being offered by node1 is in the truststore *as defined on node2* in cassandra.yaml. if it is, secure-communication is allowed. Is my thinking right? I On Wed, Jan 6, 2016 at 1:55 PM, Neha Dave <nehajtriv...@gmail.com> wrote: > Hi Ajay, > Have a look here : > https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html > > You can configure for DC level Security: > > Procedure > > On each node under sever_encryption_options: > >- Enable internode_encryption. >The available options are: > - all > - none > - dc: Cassandra encrypts the traffic between the data centers. > - rack: Cassandra encrypts the traffic between the racks. > > regards > > Neha > > > > On Wed, Jan 6, 2016 at 12:48 PM, Singh, Abhijeet <absi...@informatica.com> > wrote: > >> Security is a very wide concept. What exactly do you want to achieve ? >> >> >> >> *From:* Ajay Garg [mailto:ajaygargn...@gmail.com] >> *Sent:* Wednesday, January 06, 2016 11:27 AM >> *To:* user@cassandra.apache.org >> *Subject:* Basic query in setting up secure inter-dc cluster >> >> >> >> Hi All. >> >> We have a 2*2 cluster deployed, but no security as of now. >> >> As a first stage, we wish to implement inter-dc security. >> >> Is it possible to enable security one machine at a time? >> >> For example, let's say the machines are DC1M1, DC1M2, DC2M1, DC2M2. >> >> If I make the changes JUST IN DC2M2 and restart it, will the traffic >> between DC1M1/DC1M2 and DC2M2 be secure? Or security will kick in ONLY >> AFTER the changes are made in all the 4 machines? >> >> Asking here, because I don't want to screw up a live cluster due to my >> lack of experience. >> >> Looking forward to some pointers. >> >> >> -- >> >> Regards, >> Ajay >> > > -- Regards, Ajay
Basic query in setting up secure inter-dc cluster
Hi All. We have a 2*2 cluster deployed, but no security as of now. As a first stage, we wish to implement inter-dc security. Is it possible to enable security one machine at a time? For example, let's say the machines are DC1M1, DC1M2, DC2M1, DC2M2. If I make the changes JUST IN DC2M2 and restart it, will the traffic between DC1M1/DC1M2 and DC2M2 be secure? Or security will kick in ONLY AFTER the changes are made in all the 4 machines? Asking here, because I don't want to screw up a live cluster due to my lack of experience. Looking forward to some pointers. -- Regards, Ajay
Re: Doubt regarding consistency-level in Cassandra-2.1.10
Hi All. I think we got the root-cause. One of the fields in one of the class was marked with "@Version" annotation, which was causing the Cassandra-Java-Driver to insert "If Not Exists" in the insert query, thus invoking SERIAL consistency-level. We removed the annotation (didn't really need that), and we have not observed the error since about an hour or so. Thanks Eric and Bryan for the help !!! Thanks and Regards, Ajay On Wed, Nov 4, 2015 at 8:51 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Hmm... ok. > > Ideally, we require :: > > a) > The intra-DC-node-syncing takes place at the statement/query level. > > b) > The inter-DC-node-syncing takes place at cassandra level. > > > That way, we don't spend too much delay at the statement/query level. > > > For the so-called CAS/lightweight transactions, the above are impossible > then? > > On Wed, Nov 4, 2015 at 5:58 AM, Bryan Cheng <br...@blockcypher.com> wrote: > >> What Eric means is that SERIAL consistency is a special type of >> consistency that is only invoked for a subset of operations: those that use >> CAS/lightweight transactions, for example "IF NOT EXISTS" queries. >> >> The differences between CAS operations and standard operations are >> significant and there are large repercussions for tunable consistency. The >> amount of time such an operation takes is greatly increased as well; you >> may need to increase your internal node-to-node timeouts . >> >> On Mon, Nov 2, 2015 at 8:01 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: >> >>> Hi Eric, >>> >>> I am sorry, but I don't understand. >>> >>> If there had been some issue in the configuration, then the >>> consistency-issue would be seen everytime (I guess). >>> As of now, the error is seen sometimes (probably 30% of times). >>> >>> On Mon, Nov 2, 2015 at 10:24 PM, Eric Stevens <migh...@gmail.com> wrote: >>> >>>> Serial consistency gets invoked at the protocol level when doing >>>> lightweight transactions such as CAS operations. If you're expecting that >>>> your topology is RF=2, N=2, it seems like some keyspace has RF=3, and so >>>> there aren't enough nodes available to satisfy serial consistency. >>>> >>>> See >>>> http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html >>>> >>>> On Mon, Nov 2, 2015 at 1:29 AM Ajay Garg <ajaygargn...@gmail.com> >>>> wrote: >>>> >>>>> Hi All. >>>>> >>>>> I have a 2*2 Network-Topology Replication setup, and I run my >>>>> application via DataStax-driver. >>>>> >>>>> I frequently get the errors of type :: >>>>> *Cassandra timeout during write query at consistency SERIAL (3 replica >>>>> were required but only 0 acknowledged the write)* >>>>> >>>>> I have already tried passing a "write-options with LOCAL_QUORUM >>>>> consistency-level" in all create/save statements, but I still get this >>>>> error. >>>>> >>>>> Does something else need to be changed in >>>>> /etc/cassandra/cassandra.yaml too? >>>>> Or may be some another place? >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Ajay >>>>> >>>> >>> >>> >>> -- >>> Regards, >>> Ajay >>> >> >> > > > -- > Regards, > Ajay > -- Regards, Ajay
Re: Doubt regarding consistency-level in Cassandra-2.1.10
Hmm... ok. Ideally, we require :: a) The intra-DC-node-syncing takes place at the statement/query level. b) The inter-DC-node-syncing takes place at cassandra level. That way, we don't spend too much delay at the statement/query level. For the so-called CAS/lightweight transactions, the above are impossible then? On Wed, Nov 4, 2015 at 5:58 AM, Bryan Cheng <br...@blockcypher.com> wrote: > What Eric means is that SERIAL consistency is a special type of > consistency that is only invoked for a subset of operations: those that use > CAS/lightweight transactions, for example "IF NOT EXISTS" queries. > > The differences between CAS operations and standard operations are > significant and there are large repercussions for tunable consistency. The > amount of time such an operation takes is greatly increased as well; you > may need to increase your internal node-to-node timeouts . > > On Mon, Nov 2, 2015 at 8:01 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Hi Eric, >> >> I am sorry, but I don't understand. >> >> If there had been some issue in the configuration, then the >> consistency-issue would be seen everytime (I guess). >> As of now, the error is seen sometimes (probably 30% of times). >> >> On Mon, Nov 2, 2015 at 10:24 PM, Eric Stevens <migh...@gmail.com> wrote: >> >>> Serial consistency gets invoked at the protocol level when doing >>> lightweight transactions such as CAS operations. If you're expecting that >>> your topology is RF=2, N=2, it seems like some keyspace has RF=3, and so >>> there aren't enough nodes available to satisfy serial consistency. >>> >>> See >>> http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html >>> >>> On Mon, Nov 2, 2015 at 1:29 AM Ajay Garg <ajaygargn...@gmail.com> wrote: >>> >>>> Hi All. >>>> >>>> I have a 2*2 Network-Topology Replication setup, and I run my >>>> application via DataStax-driver. >>>> >>>> I frequently get the errors of type :: >>>> *Cassandra timeout during write query at consistency SERIAL (3 replica >>>> were required but only 0 acknowledged the write)* >>>> >>>> I have already tried passing a "write-options with LOCAL_QUORUM >>>> consistency-level" in all create/save statements, but I still get this >>>> error. >>>> >>>> Does something else need to be changed in /etc/cassandra/cassandra.yaml >>>> too? >>>> Or may be some another place? >>>> >>>> >>>> -- >>>> Regards, >>>> Ajay >>>> >>> >> >> >> -- >> Regards, >> Ajay >> > > -- Regards, Ajay
Doubt regarding consistency-level in Cassandra-2.1.10
Hi All. I have a 2*2 Network-Topology Replication setup, and I run my application via DataStax-driver. I frequently get the errors of type :: *Cassandra timeout during write query at consistency SERIAL (3 replica were required but only 0 acknowledged the write)* I have already tried passing a "write-options with LOCAL_QUORUM consistency-level" in all create/save statements, but I still get this error. Does something else need to be changed in /etc/cassandra/cassandra.yaml too? Or may be some another place? -- Regards, Ajay
Re: Doubt regarding consistency-level in Cassandra-2.1.10
Hi Eric, I am sorry, but I don't understand. If there had been some issue in the configuration, then the consistency-issue would be seen everytime (I guess). As of now, the error is seen sometimes (probably 30% of times). On Mon, Nov 2, 2015 at 10:24 PM, Eric Stevens <migh...@gmail.com> wrote: > Serial consistency gets invoked at the protocol level when doing > lightweight transactions such as CAS operations. If you're expecting that > your topology is RF=2, N=2, it seems like some keyspace has RF=3, and so > there aren't enough nodes available to satisfy serial consistency. > > See > http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html > > On Mon, Nov 2, 2015 at 1:29 AM Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Hi All. >> >> I have a 2*2 Network-Topology Replication setup, and I run my application >> via DataStax-driver. >> >> I frequently get the errors of type :: >> *Cassandra timeout during write query at consistency SERIAL (3 replica >> were required but only 0 acknowledged the write)* >> >> I have already tried passing a "write-options with LOCAL_QUORUM >> consistency-level" in all create/save statements, but I still get this >> error. >> >> Does something else need to be changed in /etc/cassandra/cassandra.yaml >> too? >> Or may be some another place? >> >> >> -- >> Regards, >> Ajay >> > -- Regards, Ajay
Can consistency-levels be different for "read" and "write" in Datastax Java-Driver?
Right now, I have setup "LOCAL QUORUM" as the consistency level in the driver, but it seems that "SERIAL" is being used during writes, and I consistently get this error of type :: *Cassandra timeout during write query at consistency SERIAL (3 replica were required but only 0 acknowledged the write)* Am I missing something? -- Regards, Ajay
Re: Is replication possible with already existing data?
Some more observations :: a) CAS11 and CAS12 are down, CAS21 and CAS22 up. If I connect via the driver to the cluster using only CAS21 and CAS22 as contact-points, even then the exception occurs. b) CAS11 down, CAS12 up, CAS21 and CAS22 up. If I connect via the driver to the cluster using only CAS21 and CAS22 as contact-points, then connection goes fine. c) CAS11 up, CAS12 down, CAS21 and CAS22 up. If I connect via the driver to the cluster using only CAS21 and CAS22 as contact-points, then connection goes fine. Seems the java-driver is kinda always requiring either one of CAS11 or CAS12 to be up (although the expectation is that the driver must work fine if ANY of the 4 nodes is up). Thoughts, experts !? :) On Sat, Oct 24, 2015 at 9:40 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Ideas please, on what I may be doing wrong? > > On Sat, Oct 24, 2015 at 5:48 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Hi All. >> >> I have been doing extensive testing, and replication works fine, even if >> any permuatation of CAS11, CAS12, CAS21, CAS22 are downed and brought up. >> Syncing always takes place (obviously, as long as continuous-downtime-value >> does not exceed *max_hint_window_in_ms*). >> >> >> However, things behave weird when I try connecting via DataStax >> Java-Driver. >> I always add the nodes to the cluster in the order :: >> >> CAS11, CAS12, CAS21, CAS22 >> >> during "cluster.connect" method. >> >> >> Now, following happens :: >> >> a) >> If CAS11 goes down, data is persisted fine (presumably first in CAS12, >> and later replicated to CAS21 and CAS22). >> >> b) >> If CAS11 and CAS12 go down, data is NOT persisted. >> Instead the following exceptions are observed in the Java-Driver :: >> >> >> ## >> Exception in thread "main" >> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) >> tried for query failed (no host was tried) >> at >> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) >> at >> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:258) >> at com.datastax.driver.core.Cluster.connect(Cluster.java:267) >> at com.example.cassandra.SimpleClient.connect(SimpleClient.java:43) >> at >> com.example.cassandra.SimpleClientTest.setUp(SimpleClientTest.java:50) >> at >> com.example.cassandra.SimpleClientTest.main(SimpleClientTest.java:86) >> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: >> All host(s) tried for query failed (no host was tried) >> at >> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:103) >> at >> com.datastax.driver.core.SessionManager.execute(SessionManager.java:446) >> at >> com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:482) >> at >> com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:88) >> at >> com.datastax.driver.core.AbstractSession.executeAsync(AbstractSession.java:60) >> at com.datastax.driver.core.Cluster.connect(Cluster.java:260) >> ... 3 more >> >> ### >> >> >> I have already tried :: >> >> 1) >> Increasing driver-read-timeout from 12 seconds to 30 seconds. >> >> 2) >> Increasing driver-connect-timeout from 5 seconds to 30 seconds. >> >> 3) >> I have also confirmed that each of the 4 nodes are telnet-able over ports >> 9042 and 9160 each. >> >> >> Definitely seems to be some driver-issue, since >> data-persistence/replication works perfect (with any permutation) if >> data-persistence is done via "cqlsh". >> >> >> Kindly provide some pointers. >> Ultimately, it is the Java-driver that will be used in production, so it >> is imperative that data-persistence/replication happens for any downing of >> any permutation of node(s). >> >> >> Thanks and Regards, >> Ajay >> > > > > -- > Regards, > Ajay > -- Regards, Ajay
Re: Is replication possible with already existing data?
Bingo !!! Using "LoadBalancingPolicy" did the trick. Exactly what was needed !!! Thanks and Regards, Ajay On Sun, Oct 25, 2015 at 5:52 PM, Ryan Svihla <r...@foundev.pro> wrote: > Ajay, > > So It's the default driver behavior to pin requests to the first data > center it connects to (DCAwareRoundRobin strategy). but let me explain why > this is. > > I think you're thinking about data centers in Cassandra as a unit of > failure, and while you can have say a rack fail, as you scale up and use > rack awareness, it's rare you lose a whole "data center" in the sense > you're thinking about, so lets reset a bit: > >1. If I'm designing a multidc architecture, usually the nature of >latency I will not want my app servers connecting _across_ data centers. >2. So since the common desire is not to magically have very high >latency requests bleed out to remote data centers, the default behavior of >the driver is to pin to the first data center it connects too, you can >change this with a different Load Balancing Policy ( > > http://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/policies/LoadBalancingPolicy.html >) >3. However, I generally do NOT advise users connecting to an app >server from another data center, since Cassandra is a masterless >architecture you typically have issues that affect nodes, and not an entire >data center and if they affect an entire data center (say the intra DC link >is down) then it's going to affect your app server as well! > > So for new users, I typically just recommend pinning an app server to a DC > and do your data center level switching further up. You can get more > advanced and handle bleed out later, but you have to think of latencies. > > Final point, rely on repairs for your data consistency, hints are great > and all but repair is how you make sure you're in sync. > > On Sun, Oct 25, 2015 at 3:10 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Some more observations :: >> >> a) >> CAS11 and CAS12 are down, CAS21 and CAS22 up. >> If I connect via the driver to the cluster using only CAS21 and CAS22 as >> contact-points, even then the exception occurs. >> >> b) >> CAS11 down, CAS12 up, CAS21 and CAS22 up. >> If I connect via the driver to the cluster using only CAS21 and CAS22 as >> contact-points, then connection goes fine. >> >> c) >> CAS11 up, CAS12 down, CAS21 and CAS22 up. >> If I connect via the driver to the cluster using only CAS21 and CAS22 as >> contact-points, then connection goes fine. >> >> >> Seems the java-driver is kinda always requiring either one of CAS11 or >> CAS12 to be up (although the expectation is that the driver must work fine >> if ANY of the 4 nodes is up). >> >> >> Thoughts, experts !? :) >> >> >> >> On Sat, Oct 24, 2015 at 9:40 PM, Ajay Garg <ajaygargn...@gmail.com> >> wrote: >> >>> Ideas please, on what I may be doing wrong? >>> >>> On Sat, Oct 24, 2015 at 5:48 PM, Ajay Garg <ajaygargn...@gmail.com> >>> wrote: >>> >>>> Hi All. >>>> >>>> I have been doing extensive testing, and replication works fine, even >>>> if any permuatation of CAS11, CAS12, CAS21, CAS22 are downed and brought >>>> up. Syncing always takes place (obviously, as long as >>>> continuous-downtime-value does not exceed *max_hint_window_in_ms*). >>>> >>>> >>>> However, things behave weird when I try connecting via DataStax >>>> Java-Driver. >>>> I always add the nodes to the cluster in the order :: >>>> >>>> CAS11, CAS12, CAS21, CAS22 >>>> >>>> during "cluster.connect" method. >>>> >>>> >>>> Now, following happens :: >>>> >>>> a) >>>> If CAS11 goes down, data is persisted fine (presumably first in CAS12, >>>> and later replicated to CAS21 and CAS22). >>>> >>>> b) >>>> If CAS11 and CAS12 go down, data is NOT persisted. >>>> Instead the following exceptions are observed in the Java-Driver :: >>>> >>>> >>>> ## >>>> Exception in thread "main" >>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) >>>> tried for query failed (no host was tried) >>>> at >&g
Re: Is replication possible with already existing data?
Ideas please, on what I may be doing wrong? On Sat, Oct 24, 2015 at 5:48 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Hi All. > > I have been doing extensive testing, and replication works fine, even if > any permuatation of CAS11, CAS12, CAS21, CAS22 are downed and brought up. > Syncing always takes place (obviously, as long as continuous-downtime-value > does not exceed *max_hint_window_in_ms*). > > > However, things behave weird when I try connecting via DataStax > Java-Driver. > I always add the nodes to the cluster in the order :: > > CAS11, CAS12, CAS21, CAS22 > > during "cluster.connect" method. > > > Now, following happens :: > > a) > If CAS11 goes down, data is persisted fine (presumably first in CAS12, and > later replicated to CAS21 and CAS22). > > b) > If CAS11 and CAS12 go down, data is NOT persisted. > Instead the following exceptions are observed in the Java-Driver :: > > > ## > Exception in thread "main" > com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) > tried for query failed (no host was tried) > at > com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) > at > com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:258) > at com.datastax.driver.core.Cluster.connect(Cluster.java:267) > at com.example.cassandra.SimpleClient.connect(SimpleClient.java:43) > at > com.example.cassandra.SimpleClientTest.setUp(SimpleClientTest.java:50) > at > com.example.cassandra.SimpleClientTest.main(SimpleClientTest.java:86) > Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: > All host(s) tried for query failed (no host was tried) > at > com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:103) > at > com.datastax.driver.core.SessionManager.execute(SessionManager.java:446) > at > com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:482) > at > com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:88) > at > com.datastax.driver.core.AbstractSession.executeAsync(AbstractSession.java:60) > at com.datastax.driver.core.Cluster.connect(Cluster.java:260) > ... 3 more > > ### > > > I have already tried :: > > 1) > Increasing driver-read-timeout from 12 seconds to 30 seconds. > > 2) > Increasing driver-connect-timeout from 5 seconds to 30 seconds. > > 3) > I have also confirmed that each of the 4 nodes are telnet-able over ports > 9042 and 9160 each. > > > Definitely seems to be some driver-issue, since > data-persistence/replication works perfect (with any permutation) if > data-persistence is done via "cqlsh". > > > Kindly provide some pointers. > Ultimately, it is the Java-driver that will be used in production, so it > is imperative that data-persistence/replication happens for any downing of > any permutation of node(s). > > > Thanks and Regards, > Ajay > -- Regards, Ajay
Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?
Never mind Vasileios, you have been a great help !! Thanks a ton again !!! Thanks and Regards, Ajay On Sat, Oct 24, 2015 at 10:17 PM, Vasileios Vlachos < vasileiosvlac...@gmail.com> wrote: > I am not sure I fully understand the question, because nodetool repair is > one of the three ways for Cassandra to ensure consistency. If by "affect" > you mean "make your data consistent and ensure all replicas are > up-to-date", then yes, that's what I think it does. > > And yes, I would expect nodetool repair (especially depending on the > options appended to it) to have a performance impact, but how big that > impact is going to be depends on many things. > > We currently perform no scheduled repairs because of our workload and the > consistency level that we use. So, as you can understand I am certainly not > the best person to analyse that bit... > > Regards, > Vasilis > > On Sat, Oct 24, 2015 at 5:09 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Thanks a ton Vasileios !! >> >> Just one last question :: >> Does running "nodetool repair" affect the functionality of cluster for >> current-live data? >> >> It's ok if the insertions/deletions of current-live data become a little >> slow during the process, but data-consistency must be maintained. If that >> is the case, I think we are good. >> >> >> Thanks and Regards, >> Ajay >> >> On Sat, Oct 24, 2015 at 6:03 PM, Vasileios Vlachos < >> vasileiosvlac...@gmail.com> wrote: >> >>> Hello Ajay, >>> >>> Here is a good link: >>> >>> http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesManualRepair.html >>> >>> Generally, I find the DataStax docs to be OK. You could consult them for >>> all usual operations etc. Ofc there are occasions where a given concept is >>> not as clear, but you can always ask this list for clarification. >>> >>> If you find that something is wrong in the docs just email them (more >>> info and contact email here: http://docs.datastax.com/en/ ). >>> >>> Regards, >>> Vasilis >>> >>> On Sat, Oct 24, 2015 at 1:04 PM, Ajay Garg <ajaygargn...@gmail.com> >>> wrote: >>> >>>> Thanks Vasileios for the reply !!! >>>> That makes sense !!! >>>> >>>> I will be grateful if you could point me to the node-repair command for >>>> Cassandra-2.1.10. >>>> I don't want to get stuck in a wrong-versioned documentation (already >>>> bitten once hard when setting up replication). >>>> >>>> Thanks again... >>>> >>>> >>>> Thanks and Regards, >>>> Ajay >>>> >>>> On Sat, Oct 24, 2015 at 4:14 PM, Vasileios Vlachos < >>>> vasileiosvlac...@gmail.com> wrote: >>>> >>>>> Hello Ajay, >>>>> >>>>> Have a look in the *max_hint_window_in_ms* : >>>>> >>>>> >>>>> http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html >>>>> >>>>> My understanding is that if a node remains down for more than >>>>> *max_hint_window_in_ms*, then you will need to repair that node. >>>>> >>>>> Thanks, >>>>> Vasilis >>>>> >>>>> On Sat, Oct 24, 2015 at 7:48 AM, Ajay Garg <ajaygargn...@gmail.com> >>>>> wrote: >>>>> >>>>>> If a node in the cluster goes down and comes up, the data gets synced >>>>>> up on this downed node. >>>>>> Is there a limit on the interval for which the node can remain down? >>>>>> Or the data will be synced up even if the node remains down for >>>>>> weeks/months/years? >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Ajay >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Ajay >>>> >>> >>> >> >> >> -- >> Regards, >> Ajay >> > > -- Regards, Ajay
Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?
Thanks a ton Vasileios !! Just one last question :: Does running "nodetool repair" affect the functionality of cluster for current-live data? It's ok if the insertions/deletions of current-live data become a little slow during the process, but data-consistency must be maintained. If that is the case, I think we are good. Thanks and Regards, Ajay On Sat, Oct 24, 2015 at 6:03 PM, Vasileios Vlachos < vasileiosvlac...@gmail.com> wrote: > Hello Ajay, > > Here is a good link: > > http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesManualRepair.html > > Generally, I find the DataStax docs to be OK. You could consult them for > all usual operations etc. Ofc there are occasions where a given concept is > not as clear, but you can always ask this list for clarification. > > If you find that something is wrong in the docs just email them (more info > and contact email here: http://docs.datastax.com/en/ ). > > Regards, > Vasilis > > On Sat, Oct 24, 2015 at 1:04 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Thanks Vasileios for the reply !!! >> That makes sense !!! >> >> I will be grateful if you could point me to the node-repair command for >> Cassandra-2.1.10. >> I don't want to get stuck in a wrong-versioned documentation (already >> bitten once hard when setting up replication). >> >> Thanks again... >> >> >> Thanks and Regards, >> Ajay >> >> On Sat, Oct 24, 2015 at 4:14 PM, Vasileios Vlachos < >> vasileiosvlac...@gmail.com> wrote: >> >>> Hello Ajay, >>> >>> Have a look in the *max_hint_window_in_ms* : >>> >>> >>> http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html >>> >>> My understanding is that if a node remains down for more than >>> *max_hint_window_in_ms*, then you will need to repair that node. >>> >>> Thanks, >>> Vasilis >>> >>> On Sat, Oct 24, 2015 at 7:48 AM, Ajay Garg <ajaygargn...@gmail.com> >>> wrote: >>> >>>> If a node in the cluster goes down and comes up, the data gets synced >>>> up on this downed node. >>>> Is there a limit on the interval for which the node can remain down? Or >>>> the data will be synced up even if the node remains down for >>>> weeks/months/years? >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Ajay >>>> >>> >>> >> >> >> -- >> Regards, >> Ajay >> > > -- Regards, Ajay
Some questions about setting public/private IP-Addresses in Cassandra Cluster
Hi All. We have a scenario, where the Application-Server (APP), Node-1 (CAS11), and Node-2 (CAS12) are hosted in DC1. Node-3 (CAS21) and Node-4 (CAS22) are in DC2. The intention is that we provide 4-way redundancy to APP, by specifying CAS11, CAS12, CAS21 and CAS22 as the addresses via Java-Cassandra-connector. That means, as long as at least one of the 4 nodes are up, the APP should work. We are using Network-Topology, with Murmur3Paritioning. Each Cassandra-Node has two IPs :: one public, and one private-within-the-same-data-center. Following are our IP-Addresses configuration :: a) Everywhere in "cassandra-topology.properties", we have specified Public-IP-Addresses of all 4 nodes. b) In each of "listen_address" in /etc/cassandra/cassandra.yaml, we have specified the corresponding Public-IP-Address of the node. c) For CAS11 and CAS12, we have specified the corresponding private-IP-Address for "rpc_address" in /etc/cassandra/cassandra.yaml (since APP is hosted in the same data-center). For CAS21 and CAS22, we have specified the corresponding public-IP-Address for "rpc_address" in /etc/cassandra/cassandra.yaml (since APP can only communicate over public IP-Addresses with these nodes). Are any further optimizations possible, in the sense that specifying private-IP-Addresses would work? I ask this, because we need to minimize network-latency, so possibility of private-IP-addresses will help in this regard. Thanks and Regards, Ajay
Downtime-Limit for a node in Network-Topology-Replication-Cluster?
If a node in the cluster goes down and comes up, the data gets synced up on this downed node. Is there a limit on the interval for which the node can remain down? Or the data will be synced up even if the node remains down for weeks/months/years? -- Regards, Ajay
Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?
Thanks Vasileios for the reply !!! That makes sense !!! I will be grateful if you could point me to the node-repair command for Cassandra-2.1.10. I don't want to get stuck in a wrong-versioned documentation (already bitten once hard when setting up replication). Thanks again... Thanks and Regards, Ajay On Sat, Oct 24, 2015 at 4:14 PM, Vasileios Vlachos < vasileiosvlac...@gmail.com> wrote: > Hello Ajay, > > Have a look in the *max_hint_window_in_ms* : > > > http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html > > My understanding is that if a node remains down for more than > *max_hint_window_in_ms*, then you will need to repair that node. > > Thanks, > Vasilis > > On Sat, Oct 24, 2015 at 7:48 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> If a node in the cluster goes down and comes up, the data gets synced up >> on this downed node. >> Is there a limit on the interval for which the node can remain down? Or >> the data will be synced up even if the node remains down for >> weeks/months/years? >> >> >> >> -- >> Regards, >> Ajay >> > > -- Regards, Ajay
Re: Is replication possible with already existing data?
Hi All. I have been doing extensive testing, and replication works fine, even if any permuatation of CAS11, CAS12, CAS21, CAS22 are downed and brought up. Syncing always takes place (obviously, as long as continuous-downtime-value does not exceed *max_hint_window_in_ms*). However, things behave weird when I try connecting via DataStax Java-Driver. I always add the nodes to the cluster in the order :: CAS11, CAS12, CAS21, CAS22 during "cluster.connect" method. Now, following happens :: a) If CAS11 goes down, data is persisted fine (presumably first in CAS12, and later replicated to CAS21 and CAS22). b) If CAS11 and CAS12 go down, data is NOT persisted. Instead the following exceptions are observed in the Java-Driver :: ## Exception in thread "main" com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:258) at com.datastax.driver.core.Cluster.connect(Cluster.java:267) at com.example.cassandra.SimpleClient.connect(SimpleClient.java:43) at com.example.cassandra.SimpleClientTest.setUp(SimpleClientTest.java:50) at com.example.cassandra.SimpleClientTest.main(SimpleClientTest.java:86) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:103) at com.datastax.driver.core.SessionManager.execute(SessionManager.java:446) at com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:482) at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:88) at com.datastax.driver.core.AbstractSession.executeAsync(AbstractSession.java:60) at com.datastax.driver.core.Cluster.connect(Cluster.java:260) ... 3 more ### I have already tried :: 1) Increasing driver-read-timeout from 12 seconds to 30 seconds. 2) Increasing driver-connect-timeout from 5 seconds to 30 seconds. 3) I have also confirmed that each of the 4 nodes are telnet-able over ports 9042 and 9160 each. Definitely seems to be some driver-issue, since data-persistence/replication works perfect (with any permutation) if data-persistence is done via "cqlsh". Kindly provide some pointers. Ultimately, it is the Java-driver that will be used in production, so it is imperative that data-persistence/replication happens for any downing of any permutation of node(s). Thanks and Regards, Ajay
Re: Is replication possible with already existing data?
Any ideas, please? To repeat, we are using the exact same cassandra-version on all 4 nodes (2.1.10). On Fri, Oct 23, 2015 at 9:43 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Hi Michael. > > Please find below the contents of cassandra.yaml for CAS11 (the files on > the rest of the three nodes are also exactly the same, except the > "initial_token" and "listen_address" fields) :: > > CAS11 :: > > > cluster_name: 'InstaMsg Cluster' > num_tokens: 256 > initial_token: -9223372036854775808 > hinted_handoff_enabled: true > max_hint_window_in_ms: 1080 # 3 hours > hinted_handoff_throttle_in_kb: 1024 > max_hints_delivery_threads: 2 > batchlog_replay_throttle_in_kb: 1024 > authenticator: AllowAllAuthenticator > authorizer: AllowAllAuthorizer > permissions_validity_in_ms: 2000 > partitioner: org.apache.cassandra.dht.Murmur3Partitioner > data_file_directories: > - /var/lib/cassandra/data > > commitlog_directory: /var/lib/cassandra/commitlog > > disk_failure_policy: stop > commit_failure_policy: stop > key_cache_size_in_mb: > key_cache_save_period: 14400 > row_cache_size_in_mb: 0 > row_cache_save_period: 0 > counter_cache_size_in_mb: > counter_cache_save_period: 7200 > saved_caches_directory: /var/lib/cassandra/saved_caches > commitlog_sync: periodic > commitlog_sync_period_in_ms: 1 > commitlog_segment_size_in_mb: 32 > seed_provider: > - class_name: org.apache.cassandra.locator.SimpleSeedProvider > parameters: > - seeds: "104.239.200.33,119.9.92.77" > > concurrent_reads: 32 > concurrent_writes: 32 > concurrent_counter_writes: 32 > > memtable_allocation_type: heap_buffers > > index_summary_capacity_in_mb: > index_summary_resize_interval_in_minutes: 60 > trickle_fsync: false > trickle_fsync_interval_in_kb: 10240 > storage_port: 7000 > ssl_storage_port: 7001 > listen_address: 104.239.200.33 > start_native_transport: true > native_transport_port: 9042 > start_rpc: true > rpc_address: localhost > rpc_port: 9160 > rpc_keepalive: true > > rpc_server_type: sync > thrift_framed_transport_size_in_mb: 15 > incremental_backups: false > snapshot_before_compaction: false > auto_snapshot: true > > tombstone_warn_threshold: 1000 > tombstone_failure_threshold: 10 > > column_index_size_in_kb: 64 > batch_size_warn_threshold_in_kb: 5 > > compaction_throughput_mb_per_sec: 16 > compaction_large_partition_warning_threshold_mb: 100 > > sstable_preemptive_open_interval_in_mb: 50 > > read_request_timeout_in_ms: 5000 > range_request_timeout_in_ms: 1 > > write_request_timeout_in_ms: 2000 > counter_write_request_timeout_in_ms: 5000 > cas_contention_timeout_in_ms: 1000 > truncate_request_timeout_in_ms: 6 > request_timeout_in_ms: 1 > cross_node_timeout: false > endpoint_snitch: PropertyFileSnitch > > dynamic_snitch_update_interval_in_ms: 100 > dynamic_snitch_reset_interval_in_ms: 60 > dynamic_snitch_badness_threshold: 0.1 > > request_scheduler: org.apache.cassandra.scheduler.NoScheduler > > server_encryption_options: > internode_encryption: none > keystore: conf/.keystore > keystore_password: cassandra > truststore: conf/.truststore > truststore_password: cassandra > > client_encryption_options: > enabled: false > keystore: conf/.keystore > keystore_password: cassandra > > internode_compression: all > inter_dc_tcp_nodelay: false > > > > What changes need to be made, so that whenever a downed server comes back > up, the missing data comes back over to it? > > Thanks and Regards, > Ajay > > > > On Fri, Oct 23, 2015 at 9:05 AM, Michael Shuler <mich...@pbandjelly.org> > wrote: > >> On 10/22/2015 10:14 PM, Ajay Garg wrote: >> >>> However, CAS11 refuses to come up now. >>> Following is the error in /var/log/cassandra/system.log :: >>> >>> >>> >>> ERROR [main] 2015-10-23 03:07:34,242 CassandraDaemon.java:391 - Fatal >>> configuration error >>> org.apache.cassandra.exceptions.ConfigurationException: Cannot change >>> the number of tokens from 1 to 256 >>> >> >> Check your cassandra.yaml - this node has vnodes enabled in the >> configuration when it did not, previously. Check all nodes. Something >> changed. Mixed vnode/non-vnode clusters is bad juju. >> >> -- >> Kind regards, >> Michael >> > > > > -- > Regards, > Ajay > -- Regards, Ajay
Re: Is replication possible with already existing data?
Thanks Steve and Michael. Simply uncommenting "initial_token" did the trick !!! Right now, I was evaluating replication, for the case when everything is a clean install. Will now try my hands on integrating/starting replication, with pre-existing data. Once again, thanks a ton for all the help guys !!! Thanks and Regards, Ajay On Sat, Oct 24, 2015 at 2:06 AM, Steve Robenalt <sroben...@highwire.org> wrote: > Hi Ajay, > > Please take a look at the cassandra.yaml configuration reference regarding > intial_token and num_tokens: > > > http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__initial_token > > This is basically what Michael was referring to in his earlier message. > Setting an initial token overrode your num_tokens setting on initial > startup, but after initial startup, the initial token setting is ignored, > so num_tokens comes into play, attempting to start up with 256 vnodes. > That's where your error comes from. > > It's likely that all of your nodes started up like this since you have the > same config on all of them (hopefully, you at least changed initial_token > for each node). > > After reviewing the doc on the two sections above, you'll need to decide > which path to take to recover. You can likely bring the downed node up by > setting num_tokens to 1 (which you'd need to do on all nodes), in which > case you're not really running vnodes. Alternately, you can migrate the > cluster to vnodes: > > > http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configVnodesProduction_t.html > > BTW, I recommend carefully reviewing the cassandra.yaml configuration > reference for ANY change you make from the default. As you've experienced > here, not all settings are intended to work together. > > HTH, > Steve > > > > On Fri, Oct 23, 2015 at 12:07 PM, Ajay Garg <ajaygargn...@gmail.com> > wrote: > >> Any ideas, please? >> To repeat, we are using the exact same cassandra-version on all 4 nodes >> (2.1.10). >> >> On Fri, Oct 23, 2015 at 9:43 AM, Ajay Garg <ajaygargn...@gmail.com> >> wrote: >> >>> Hi Michael. >>> >>> Please find below the contents of cassandra.yaml for CAS11 (the files on >>> the rest of the three nodes are also exactly the same, except the >>> "initial_token" and "listen_address" fields) :: >>> >>> CAS11 :: >>> >>> >>> >>> What changes need to be made, so that whenever a downed server comes >>> back up, the missing data comes back over to it? >>> >>> Thanks and Regards, >>> Ajay >>> >>> >>> >>> On Fri, Oct 23, 2015 at 9:05 AM, Michael Shuler <mich...@pbandjelly.org> >>> wrote: >>> >>>> On 10/22/2015 10:14 PM, Ajay Garg wrote: >>>> >>>>> However, CAS11 refuses to come up now. >>>>> Following is the error in /var/log/cassandra/system.log :: >>>>> >>>>> >>>>> >>>>> ERROR [main] 2015-10-23 03:07:34,242 CassandraDaemon.java:391 - Fatal >>>>> configuration error >>>>> org.apache.cassandra.exceptions.ConfigurationException: Cannot change >>>>> the number of tokens from 1 to 256 >>>>> >>>> >>>> Check your cassandra.yaml - this node has vnodes enabled in the >>>> configuration when it did not, previously. Check all nodes. Something >>>> changed. Mixed vnode/non-vnode clusters is bad juju. >>>> >>>> -- >>>> Kind regards, >>>> Michael >>>> >>> >>> >>> >>> -- >>> Regards, >>> Ajay >>> >> >> >> >> -- >> Regards, >> Ajay >> > > > > -- > Steve Robenalt > Software Architect > sroben...@highwire.org <bza...@highwire.org> > (office/cell): 916-505-1785 > > HighWire Press, Inc. > 425 Broadway St, Redwood City, CA 94063 > www.highwire.org > > Technology for Scholarly Communication > -- Regards, Ajay
Re: Is replication possible with already existing data?
Hi Carlos. I setup a following setup :: CAS11 and CAS12 in DC1 CAS21 and CAS22 in DC2 a) Brought all the 4 up, replication worked perfect !!! b) Thereafter, downed CAS11 via "sudo service cassandra stop". Replication continued to work fine on CAS12, CAS21 and CAS22. c) Thereafter, upped CAS11 via "sudo service cassandra start". However, CAS11 refuses to come up now. Following is the error in /var/log/cassandra/system.log :: ERROR [main] 2015-10-23 03:07:34,242 CassandraDaemon.java:391 - Fatal configuration error org.apache.cassandra.exceptions.ConfigurationException: Cannot change the number of tokens from 1 to 256 at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:966) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:734) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:611) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:387) [apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:562) [apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:651) [apache-cassandra-2.1.10.jar:2.1.10] INFO [StorageServiceShutdownHook] 2015-10-23 03:07:34,271 Gossiper.java:1442 - Announcing shutdown INFO [GossipStage:1] 2015-10-23 03:07:34,282 OutboundTcpConnection.java:97 - OutboundTcpConnection using coalescing strategy DISABLED ERROR [StorageServiceShutdownHook] 2015-10-23 03:07:34,305 CassandraDaemon.java:227 - Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException: null at org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1624) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1632) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1686) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1510) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1182) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal(Gossiper.java:1412) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.gms.Gossiper.addLocalApplicationStates(Gossiper.java:1427) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1417) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1443) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:678) ~[apache-cassandra-2.1.10.jar:2.1.10] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.10.jar:2.1.10] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60] Ideas? Thanks and Regards, Ajay On Mon, Oct 12, 2015 at 3:46 PM, Carlos Alonso <i...@mrcalonso.com> wrote: > Yes Ajay, in your particular scenario, after all hints are delivered, both > CAS11 and CAS12 will have the exact same data. > > Cheers! > > Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso> > > On 11 October 2015 at 05:21, Ajay Garg <ajaygargn...@gmail.com> wrote: > >> Thanks a ton Anuja for the help !!! >> >> On Fri, Oct 9, 2015 at 12:38 PM, anuja jain <anujaja...@gmail.com> wrote: >> > Hi Ajay, >> > >> > >> > On Fri, Oct 9, 2015 at 9:00 AM, Ajay Garg <ajaygargn...@gmail.com> >> wrote: >> >> >> > In this case, it will be the responsibility of APP1 to start connection >> to >> > CAS12. On the other hand if your APP1 is connecting to cassandra using >> Java >> > driver, you can add multiple contact points(CAS11 and CAS12 here) so >> that if >> > CAS11 is down it will directly connect to CAS12. >> >> Great .. Java-driver it will be :) >> >> >> >> >> >> >> > In such a case, CAS12 will store hints for the data to be stored on >> CAS11 >> > (the tokens of which lies within the range of tokens CAS11 holds) and >> > whenever CAS11 is up again, the hints will be transferred to it and the >> data >> > will be di
Re: Is replication possible with already existing data?
Hi Michael. Please find below the contents of cassandra.yaml for CAS11 (the files on the rest of the three nodes are also exactly the same, except the "initial_token" and "listen_address" fields) :: CAS11 :: cluster_name: 'InstaMsg Cluster' num_tokens: 256 initial_token: -9223372036854775808 hinted_handoff_enabled: true max_hint_window_in_ms: 1080 # 3 hours hinted_handoff_throttle_in_kb: 1024 max_hints_delivery_threads: 2 batchlog_replay_throttle_in_kb: 1024 authenticator: AllowAllAuthenticator authorizer: AllowAllAuthorizer permissions_validity_in_ms: 2000 partitioner: org.apache.cassandra.dht.Murmur3Partitioner data_file_directories: - /var/lib/cassandra/data commitlog_directory: /var/lib/cassandra/commitlog disk_failure_policy: stop commit_failure_policy: stop key_cache_size_in_mb: key_cache_save_period: 14400 row_cache_size_in_mb: 0 row_cache_save_period: 0 counter_cache_size_in_mb: counter_cache_save_period: 7200 saved_caches_directory: /var/lib/cassandra/saved_caches commitlog_sync: periodic commitlog_sync_period_in_ms: 1 commitlog_segment_size_in_mb: 32 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "104.239.200.33,119.9.92.77" concurrent_reads: 32 concurrent_writes: 32 concurrent_counter_writes: 32 memtable_allocation_type: heap_buffers index_summary_capacity_in_mb: index_summary_resize_interval_in_minutes: 60 trickle_fsync: false trickle_fsync_interval_in_kb: 10240 storage_port: 7000 ssl_storage_port: 7001 listen_address: 104.239.200.33 start_native_transport: true native_transport_port: 9042 start_rpc: true rpc_address: localhost rpc_port: 9160 rpc_keepalive: true rpc_server_type: sync thrift_framed_transport_size_in_mb: 15 incremental_backups: false snapshot_before_compaction: false auto_snapshot: true tombstone_warn_threshold: 1000 tombstone_failure_threshold: 10 column_index_size_in_kb: 64 batch_size_warn_threshold_in_kb: 5 compaction_throughput_mb_per_sec: 16 compaction_large_partition_warning_threshold_mb: 100 sstable_preemptive_open_interval_in_mb: 50 read_request_timeout_in_ms: 5000 range_request_timeout_in_ms: 1 write_request_timeout_in_ms: 2000 counter_write_request_timeout_in_ms: 5000 cas_contention_timeout_in_ms: 1000 truncate_request_timeout_in_ms: 6 request_timeout_in_ms: 1 cross_node_timeout: false endpoint_snitch: PropertyFileSnitch dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 60 dynamic_snitch_badness_threshold: 0.1 request_scheduler: org.apache.cassandra.scheduler.NoScheduler server_encryption_options: internode_encryption: none keystore: conf/.keystore keystore_password: cassandra truststore: conf/.truststore truststore_password: cassandra client_encryption_options: enabled: false keystore: conf/.keystore keystore_password: cassandra internode_compression: all inter_dc_tcp_nodelay: false What changes need to be made, so that whenever a downed server comes back up, the missing data comes back over to it? Thanks and Regards, Ajay On Fri, Oct 23, 2015 at 9:05 AM, Michael Shuler <mich...@pbandjelly.org> wrote: > On 10/22/2015 10:14 PM, Ajay Garg wrote: > >> However, CAS11 refuses to come up now. >> Following is the error in /var/log/cassandra/system.log :: >> >> >> >> ERROR [main] 2015-10-23 03:07:34,242 CassandraDaemon.java:391 - Fatal >> configuration error >> org.apache.cassandra.exceptions.ConfigurationException: Cannot change >> the number of tokens from 1 to 256 >> > > Check your cassandra.yaml - this node has vnodes enabled in the > configuration when it did not, previously. Check all nodes. Something > changed. Mixed vnode/non-vnode clusters is bad juju. > > -- > Kind regards, > Michael > -- Regards, Ajay
Re: Is replication possible with already existing data?
Thanks a ton Anuja for the help !!! On Fri, Oct 9, 2015 at 12:38 PM, anuja jain <anujaja...@gmail.com> wrote: > Hi Ajay, > > > On Fri, Oct 9, 2015 at 9:00 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: >> > In this case, it will be the responsibility of APP1 to start connection to > CAS12. On the other hand if your APP1 is connecting to cassandra using Java > driver, you can add multiple contact points(CAS11 and CAS12 here) so that if > CAS11 is down it will directly connect to CAS12. Great .. Java-driver it will be :) >> > In such a case, CAS12 will store hints for the data to be stored on CAS11 > (the tokens of which lies within the range of tokens CAS11 holds) and > whenever CAS11 is up again, the hints will be transferred to it and the data > will be distributed evenly. > Evenly? Should not the data be """EXACTLY""" equal after CAS11 comes back up and the sync/transfer/whatever happens? After all, before CAS11 went down, CAS11 and CAS12 were replicating all data. Once again, thanks for your help. I will be even more grateful if you would help me clear the lingering doubt to second point. Thanks and Regards, Ajay
Re: Is replication possible with already existing data?
On Thu, Oct 8, 2015 at 9:47 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Thanks Eric for the reply. > > > On Thu, Oct 8, 2015 at 1:44 AM, Eric Stevens <migh...@gmail.com> wrote: >> If you're at 1 node (N=1) and RF=1 now, and you want to go N=3 RF=3, you >> ought to be able to increase RF to 3 before bootstrapping your new nodes, >> with no downtime and no loss of data (even temporary). Effective RF is >> min-bounded by N, so temporarily having RF > N ought to behave as RF = N. >> >> If you're starting at N > RF and you want to increase RF, things get >> harrier >> if you can't afford temporary consistency issues. >> > > We are ok with temporary consistency issues. > > Also, I was going through the following articles > https://10kloc.wordpress.com/2012/12/27/cassandra-chapter-5-data-replication-strategies/ > > and following doubts came up in my mind :: > > > a) > Let's say at site-1, Application-Server (APP1) uses the two > Cassandra-instances (CAS11 and CAS12), and APP1 generally uses CAS11 for all > its needs (of course, whatever happens on CAS11, the same is replicated to > CAS12 at Cassandra-level). > > Now, if CAS11 goes down, will it be the responsibility of APP1 to "detect" > this and pick up CAS12 for its needs? > Or some automatic Cassandra-magic will happen? > > > b) > In the same above scenario, let's say before CAS11 goes down, the amount of > data in both CAS11 and CAS12 was "x". > > After CAS11 goes down, the data is being put in CAS12 only. > After some time, CAS11 comes back up. > > Now, data in CAS11 is still "x", while data in CAS12 is "y" (obviously, "y" >> "x"). > > Now, will the additional ("y" - "x") data be automatically > put/replicated/whatever back in CAS11 through Cassandra? > Or it has to be done manually? > Any pointers, please ??? > > If there are easy recommended solutions to above, I am beginning to think > that a 2*2 (2 nodes each at 2 data-centres) will be the ideal setup > (allowing failures of entire site, or a few nodes on the same site). > > I am sorry for asking such newbie questions, and I will be grateful if these > silly questions could be answered by the experts :) > > > Thanks and Regards, > Ajay -- Regards, Ajay
Is replication possible with already existing data?
Hi All. We have a scenario, where till now we had been using a plain, simple single node, with the keyspace created using :: CREATE KEYSPACE our_db WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true; We now plan to introduce replication (in the true sense) in our scheme of things, but cannot afford to lose any data. We, however can take a bit of downtime, and do any data-migration if required (we have already done data-migration once in the past, when we moved our plain, simple single node from one physical machine to another). So, a) Is it possible at all to introduce replication in our scenario? If yes, what needs to be done to NOT LOSE our current existing data? b) Also, will "NetworkTopologyStrategy" work in our scenario (since NetworkTopologyStrategy seems to be more robust)? Brief pointers to above will give huge confidence-boosts in our endeavours. Thanks and Regards, Ajay
Re: Is replication possible with already existing data?
Hi Sean. Thanks for the reply. On Wed, Oct 7, 2015 at 10:13 PM, <sean_r_dur...@homedepot.com> wrote: > How many nodes are you planning to add? I guess 2 more. > How many replicas do you want? 1 (original) + 2 (replicas). That makes it a total of 3 copies of every row of data. > In general, there shouldn't be a problem adding nodes and then altering the > keyspace to change replication. Great !! I guess http://docs.datastax.com/en/cql/3.0/cql/cql_reference/alter_keyspace_r.html will do the trick for changing schema-replication-details !! > You will want to run repairs to stream the data to the new replicas. Hmm.. we'll be really grateful if you could point us to a suitable link for the above step. If there is a nice-utility, we would be perfectly set up to start our fun-exercise, consisting of following steps :: a) (As advised by you) Changing the schema, to allow a replication_factor of 3. b) (As advised by you) Duplicating the already-existing-data on the other 2 nodes. c) Thereafter, let Cassandra create a total of 3 copies for every row of new-incoming-data. Once again, thanks a ton for the help !! Thanks and Regards, Ajay > You shouldn't need downtime or data migration -- this is the beauty of > Cassandra. > > > Sean Durity – Lead Cassandra Admin > > > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email by > anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be taken > in reliance on it, is prohibited and may be unlawful. When addressed to our > clients any opinions or advice contained in this Email are subject to the > terms and conditions expressed in any applicable governing The Home Depot > terms of business or client engagement letter. The Home Depot disclaims all > responsibility and liability for the accuracy and content of this attachment > and for any damages or losses arising from any inaccuracies, errors, viruses, > e.g., worms, trojan horses, etc., or other items of a destructive nature, > which may be contained in this attachment and shall not be liable for direct, > indirect, consequential or special damages in connection with this e-mail > message or its attachment. -- Regards, Ajay
Re: Possible to restore ENTIRE data from Cassandra-Schema in one go?
Thanks Mam for the reply. I guess there is manual work needed to bring all the SSTables files into one directory, so doesn't really solve the purpose I guess. So, going the "vanilla" way might be simpler :) Thanks anyways for the help !!! Thanks and Regards, Ajay On Tue, Sep 15, 2015 at 11:34 AM, Neha Dave <nehajtriv...@gmail.com> wrote: > Havent used it.. but u can try SSTaable Bulk Loader: > > http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsBulkloader_t.html > > regards > Neha > > On Tue, Sep 15, 2015 at 11:21 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: >> >> Hi All. >> >> We have a schema on one Cassandra-node, and wish to duplicate the >> entire schema on another server. >> Think of this a 2 clusters, each cluster containing one node. >> >> We have found the way to dump/restore schema-metainfo at :: >> >> https://dzone.com/articles/dumpingloading-schema >> >> >> And dumping/restoring data at :: >> >> >> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_takes_snapshot_t.html >> >> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html >> >> >> For the restoring data step, it seems that restoring every "table" >> requires a dedicated step. >> So, if the schema has 100 "tables", we would need 100 steps. >> >> >> Is it so? If yes, can the entire data be dumped/restored in one go? >> Just asking, to save time, if it could :) >> >> >> >> >> Thanks and Regards, >> Ajay > > -- Regards, Ajay
Re: Getting intermittent errors while taking snapshot
Hi All. Granting complete-permissions to the keyspace-folder (/var/lib/cassandra/data/instamsg) fixed the issue. Now, multiple, successive snapshot-commands run to completion fine. sudo chmod -R 777 /var/lib/cassandra/data/instamsg Thanks and Regards, Ajay On Tue, Sep 15, 2015 at 12:04 PM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Hi All. > > Taking snapshots sometimes works, sometimes don't. > Following is the stacktrace whenever the process fails :: > > > ###### > ajay@ajay-HP-15-Notebook-PC:/var/lib/cassandra/data/instamsg$ nodetool > -h localhost -p 7199 snapshot instamsgRequested creating snapshot(s) > for [instamsg] with snapshot name [1442298538121] > error: > /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/snapshots/1442298538121/instamsg-clients-ka-15-TOC.txt > -> > /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/instamsg-clients-ka-15-TOC.txt: > Operation not permitted > -- StackTrace -- > java.nio.file.FileSystemException: > /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/snapshots/1442298538121/instamsg-clients-ka-15-TOC.txt > -> > /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/instamsg-clients-ka-15-TOC.txt: > Operation not permitted > at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at > sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) > at java.nio.file.Files.createLink(Files.java:1086) > at > org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:94) > at > org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1842) > at > org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:2279) > at > org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2361) > at > org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2355) > at org.apache.cassandra.db.Keyspace.snapshot(Keyspace.java:207) > at > org.apache.cassandra.service.StorageService.takeSnapshot(StorageService.java:2388) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275) > at > com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112) > at > com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46) > at > com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237) > at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138) > at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) > at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) > at > javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466) > at > javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76) > at > javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307) > at > javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399) > at > javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:828) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323) >
Getting intermittent errors while taking snapshot
Hi All. Taking snapshots sometimes works, sometimes don't. Following is the stacktrace whenever the process fails :: ## ajay@ajay-HP-15-Notebook-PC:/var/lib/cassandra/data/instamsg$ nodetool -h localhost -p 7199 snapshot instamsgRequested creating snapshot(s) for [instamsg] with snapshot name [1442298538121] error: /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/snapshots/1442298538121/instamsg-clients-ka-15-TOC.txt -> /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/instamsg-clients-ka-15-TOC.txt: Operation not permitted -- StackTrace -- java.nio.file.FileSystemException: /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/snapshots/1442298538121/instamsg-clients-ka-15-TOC.txt -> /var/lib/cassandra/data/instamsg/clients-b32f01b02eec11e5866887c3880d7c45/instamsg-clients-ka-15-TOC.txt: Operation not permitted at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:94) at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1842) at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:2279) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2361) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2355) at org.apache.cassandra.db.Keyspace.snapshot(Keyspace.java:207) at org.apache.cassandra.service.StorageService.takeSnapshot(StorageService.java:2388) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:828) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323) at sun.rmi.transport.Transport$1.run(Transport.java:200) at sun.rmi.transport.Transport$1.run(Transport.java:197) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:196) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$251(TCPTransport.java:683) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$1/13812661.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav
Re: Not able to cqlsh on 2.1.9 on Ubuntu 14.04
Hi All. Thanks for your replies. a) cqlsh does not work either :( b) Following are the parameters as asked :: listen_address: localhost rpc_address: localhost broadcast_rpc_address is not set. According to the yaml file :: # RPC address to broadcast to drivers and other Cassandra nodes. This cannot # be set to 0.0.0.0. If left blank, this will be set to the value of # rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must # be set. # broadcast_rpc_address: 1.2.3.4 c) Following is the netstat-output, with process information :: ### ajay@comp:~$ sudo netstat -apn | grep 9042 [sudo] password for admin: tcp6 0 0 127.0.0.1:9042 :::* LISTEN 10169/java ### Kindly let me know what else we can try .. it is really driving us nuttsss :( On Mon, Sep 14, 2015 at 9:40 PM, Jared Biel <jared.b...@bolderthinking.com> wrote: > Whoops, I accidentally pressed a hotkey and sent my message prematurely. > Here's what netstat should look like with those settings: > > sudo netstat -apn | grep 9042 > tcp6 0 0 0.0.0.0:9042:::*LISTEN > 21248/java > > -Jared > > On 14 September 2015 at 16:09, Jared Biel <jared.b...@bolderthinking.com> > wrote: >> >> I assume "@ Of node" is ethX's IP address? Has cassandra been restarted >> since changes were made to cassandra.yaml? The netstat output that you >> posted doesn't look right; we use settings similar to what you've posted. >> Here's what it looks like on one of our nodes. >> >> >> -Jared >> >> On 14 September 2015 at 10:34, Ahmed Eljami <ahmed.elj...@gmail.com> >> wrote: >>> >>> In cassanrda.yaml: >>> listen_address:@ Of node >>> rpc_address:0.0.0.0 >>> >>> brodcast_rpc_address:@ Of node >>> >>> 2015-09-14 11:31 GMT+01:00 Neha Dave <nehajtriv...@gmail.com>: >>>> >>>> Try >>>> >cqlsh >>>> >>>> regards >>>> Neha >>>> >>>> On Mon, Sep 14, 2015 at 3:53 PM, Ajay Garg <ajaygargn...@gmail.com> >>>> wrote: >>>>> >>>>> Hi All. >>>>> >>>>> We have setup a Ubuntu-14.04 server, and followed the steps exactly as >>>>> per http://wiki.apache.org/cassandra/DebianPackaging >>>>> >>>>> Installation completes fine, Cassandra starts fine, however cqlsh does >>>>> not work. >>>>> We get the error :: >>>>> >>>>> >>>>> ### >>>>> ajay@comp:~$ cqlsh >>>>> Connection error: ('Unable to connect to any servers', {'127.0.0.1': >>>>> error(None, "Tried connecting to [('127.0.0.1', 9042)]. Last error: >>>>> None")}) >>>>> >>>>> ### >>>>> >>>>> >>>>> >>>>> Version-Info :: >>>>> >>>>> >>>>> ### >>>>> ajay@comp:~$ dpkg -l | grep cassandra >>>>> ii cassandra 2.1.9 >>>>> all distributed storage system for structured data >>>>> >>>>> ### >>>>> >>>>> >>>>> >>>>> The port "seems" to be opened fine. >>>>> >>>>> >>>>> ### >>>>> ajay@comp:~$ netstat -an | grep 9042 >>>>> tcp6 0 0 127.0.0.1:9042 :::* >>>>> LISTEN >>>>> >>>>> ### >>>>> >>>>> >>>>> >>>>> Firewall-filters :: >>>>> >>>>> >>>>> ### >>>>> ajay@comp:~$ sudo
Re: Not able to cqlsh on 2.1.9 on Ubuntu 14.04
Hi All. I re-established my server from scratch, and installed the 21x server. Now, cqlsh works right out of the box. When I had last setup the server, I had (accidentally) installed the 20x server on first attempt, removed it, and then installed the 21x series server. Seems that caused some hidden problem. I am heartfully grateful to everyone for bearing with me. Thanks and Regards, Ajay On Tue, Sep 15, 2015 at 10:16 AM, Ajay Garg <ajaygargn...@gmail.com> wrote: > Hi Jared. > > Thanks for your help. > > I made the config-changes. > Also, I changed the seed (right now, we are just trying to get one > instance up and running) :: > > > seed_provider: > # Addresses of hosts that are deemed contact points. > # Cassandra nodes use this list of hosts to find each other and learn > # the topology of the ring. You must change this if you are running > # multiple nodes! > - class_name: org.apache.cassandra.locator.SimpleSeedProvider > parameters: > # seeds is actually a comma-delimited list of addresses. > # Ex: ",," > - seeds: "our.ip.address.here" > > > > > > Following is the netstat output :: > > > ajay@comp:~$ sudo netstat -apn | grep 9042 > tcp6 0 0 0.0.0.0:9042:::* > LISTEN 22469/java > #### > > > > Still, when I try, we get :: > > > ajay@comp:~$ cqlsh our.ip.address.here > Connection error: ('Unable to connect to any servers', > {'our.ip.address.here': error(None, "Tried connecting to > [('our.ip.address.here', 9042)]. Last error: None")}) > > > > :( :( > > On Mon, Sep 14, 2015 at 11:00 PM, Jared Biel > <jared.b...@bolderthinking.com> wrote: >> Is there a reason that you're setting listen_address and rpc_address to >> localhost? >> >> listen_address doc: "the Right Thing is to use the address associated with >> the hostname". So, set the IP address of this to eth0 for example. I believe >> if it is set to localhost then you won't be able to form a cluster with >> other nodes. >> >> rpc_address: this is the address to which clients will connect. I recommend >> 0.0.0.0 here so clients can connect to IP address of the server as well as >> localhost if they happen to reside on the same instance. >> >> >> Here are all of the address settings from our config file. 192.168.1.10 is >> the IP address of eth0 and broadcast_address is commented out. >> >> listen_address: 192.168.1.10 >> # broadcast_address: 1.2.3.4 >> rpc_address: 0.0.0.0 >> broadcast_rpc_address: 192.168.1.10 >> >> Follow these directions to get up and running with the first node >> (destructive process): >> >> 1. Stop cassandra >> 2. Remove data from cassandra var directory (rm -rf /var/lib/cassandra/*) >> 3. Make above changes to config file. Also set seeds to the eth0 IP address >> 4. Start cassandra >> 5. Set seeds in config file back to "" after cassandra is up and running. >> >> After following that process, you'll be able to connect to the node from any >> host that can reach Cassandra's ports on that node ("cqlsh" command will >> work.) To join more nodes to the cluster, follow the steps same steps as >> above, except the seeds value to the IP address of an already running node. >> >> Regarding the empty "seeds" config entry: our configs are automated with >> configuration management. During the node bootstrap process a script >> performs the above. The reason that we set seeds back to empty is that we >> don't want nodes coming up/down to cause the config file to change and thus >> cassandra to restart needlessly. So far we haven't had any issues with seeds >> being set to empty after a node has joined the cluster, but this may not be >> the recommended way of doing things. >> >> -Jared >> >> On 14 September 2015 at 16:46, Ajay Garg <ajaygargn...@gmail.com> wrote: >>> >>> Hi All. >>> >>> Thanks for your replies. >>> >>> a) >>> cqlsh does not work either :( >>> >>> >>> b) >>> Following are the parameters as asked :: >>> >>> listen_address: localhost >&
Re: Not able to cqlsh on 2.1.9 on Ubuntu 14.04
Hi Jared. Thanks for your help. I made the config-changes. Also, I changed the seed (right now, we are just trying to get one instance up and running) :: seed_provider: # Addresses of hosts that are deemed contact points. # Cassandra nodes use this list of hosts to find each other and learn # the topology of the ring. You must change this if you are running # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: # seeds is actually a comma-delimited list of addresses. # Ex: ",," - seeds: "our.ip.address.here" Following is the netstat output :: #### ajay@comp:~$ sudo netstat -apn | grep 9042 tcp6 0 0 0.0.0.0:9042:::* LISTEN 22469/java Still, when I try, we get :: #### ajay@comp:~$ cqlsh our.ip.address.here Connection error: ('Unable to connect to any servers', {'our.ip.address.here': error(None, "Tried connecting to [('our.ip.address.here', 9042)]. Last error: None")}) :( :( On Mon, Sep 14, 2015 at 11:00 PM, Jared Biel <jared.b...@bolderthinking.com> wrote: > Is there a reason that you're setting listen_address and rpc_address to > localhost? > > listen_address doc: "the Right Thing is to use the address associated with > the hostname". So, set the IP address of this to eth0 for example. I believe > if it is set to localhost then you won't be able to form a cluster with > other nodes. > > rpc_address: this is the address to which clients will connect. I recommend > 0.0.0.0 here so clients can connect to IP address of the server as well as > localhost if they happen to reside on the same instance. > > > Here are all of the address settings from our config file. 192.168.1.10 is > the IP address of eth0 and broadcast_address is commented out. > > listen_address: 192.168.1.10 > # broadcast_address: 1.2.3.4 > rpc_address: 0.0.0.0 > broadcast_rpc_address: 192.168.1.10 > > Follow these directions to get up and running with the first node > (destructive process): > > 1. Stop cassandra > 2. Remove data from cassandra var directory (rm -rf /var/lib/cassandra/*) > 3. Make above changes to config file. Also set seeds to the eth0 IP address > 4. Start cassandra > 5. Set seeds in config file back to "" after cassandra is up and running. > > After following that process, you'll be able to connect to the node from any > host that can reach Cassandra's ports on that node ("cqlsh" command will > work.) To join more nodes to the cluster, follow the steps same steps as > above, except the seeds value to the IP address of an already running node. > > Regarding the empty "seeds" config entry: our configs are automated with > configuration management. During the node bootstrap process a script > performs the above. The reason that we set seeds back to empty is that we > don't want nodes coming up/down to cause the config file to change and thus > cassandra to restart needlessly. So far we haven't had any issues with seeds > being set to empty after a node has joined the cluster, but this may not be > the recommended way of doing things. > > -Jared > > On 14 September 2015 at 16:46, Ajay Garg <ajaygargn...@gmail.com> wrote: >> >> Hi All. >> >> Thanks for your replies. >> >> a) >> cqlsh does not work either :( >> >> >> b) >> Following are the parameters as asked :: >> >> listen_address: localhost >> rpc_address: localhost >> >> broadcast_rpc_address is not set. >> According to the yaml file :: >> >> # RPC address to broadcast to drivers and other Cassandra nodes. This >> cannot >> # be set to 0.0.0.0. If left blank, this will be set to the value of >> # rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address >> must >> # be set. >> # broadcast_rpc_address: 1.2.3.4 >> >> >> c) >> Following is the netstat-output, with process information :: >> >> >> ### >> ajay@comp:~$ sudo netstat -apn | grep 9042 >> [sudo] password for admin: >> tcp6 0 0 127.0.0.1:9042 :::* >> LISTEN 10169/java >> >>
Possible to restore ENTIRE data from Cassandra-Schema in one go?
Hi All. We have a schema on one Cassandra-node, and wish to duplicate the entire schema on another server. Think of this a 2 clusters, each cluster containing one node. We have found the way to dump/restore schema-metainfo at :: https://dzone.com/articles/dumpingloading-schema And dumping/restoring data at :: http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_takes_snapshot_t.html http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html For the restoring data step, it seems that restoring every "table" requires a dedicated step. So, if the schema has 100 "tables", we would need 100 steps. Is it so? If yes, can the entire data be dumped/restored in one go? Just asking, to save time, if it could :) Thanks and Regards, Ajay
Test Subject
Testing simple content, as my previous email bounced :( -- Regards, Ajay
Not able to cqlsh on 2.1.9 on Ubuntu 14.04
Hi All. We have setup a Ubuntu-14.04 server, and followed the steps exactly as per http://wiki.apache.org/cassandra/DebianPackaging Installation completes fine, Cassandra starts fine, however cqlsh does not work. We get the error :: ### ajay@comp:~$ cqlsh Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(None, "Tried connecting to [('127.0.0.1', 9042)]. Last error: None")}) ### Version-Info :: ### ajay@comp:~$ dpkg -l | grep cassandra ii cassandra 2.1.9 all distributed storage system for structured data ### The port "seems" to be opened fine. ### ajay@comp:~$ netstat -an | grep 9042 tcp6 0 0 127.0.0.1:9042 :::*LISTEN ### Firewall-filters :: ### ajay@comp:~$ sudo iptables -L [sudo] password for ajay: Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED ACCEPT tcp -- anywhere anywhere tcp dpt:ssh DROP all -- anywhere anywhere Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination ### Even telnet fails :( ### ajay@comp:~$ telnet localhost 9042 Trying 127.0.0.1... ### Any ideas please?? We have been stuck on this for a good 3 hours now :( Thanks and Regards, Ajay
Re: Can't connect to Cassandra server
Try with the correct IP address as below: cqlsh 192.248.15.219 -u sinmin -p xx CQL documentation - http://docs.datastax.com/en/cql/3.0/cql/cql_reference/cqlsh.html On Sun, Jul 19, 2015 at 2:00 PM, Chamila Wijayarathna cdwijayarat...@gmail.com wrote: Hello all, After starting cassandra, I tried to connect to cassandra from cqlsh and java, but it fails to do so. Following is the error I get while trying to connect to cqlsh. cqlsh -u sinmin -p xx Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused)}) I have set listen_address and rpc_address in cassandra.yaml to the ip address of server address like follows. listen_address:192.248.15.219 rpc_address:192.248.15.219 Following is what I found from cassandra system.log. https://gist.githubusercontent.com/cdwijayarathna/a14586a9e39a943f89a0/raw/system%20log Following is the netstat result I got. maduranga@ubuntu:/var/log/cassandra$ netstat Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp0 0 ubuntu:ssh 103.21.166.35:54417 ESTABLISHED tcp0 0 ubuntu:1522 ubuntu:30820 ESTABLISHED tcp0 0 ubuntu:30820ubuntu:1522 ESTABLISHED tcp0256 ubuntu:ssh 175.157.41.209:42435 ESTABLISHED Active UNIX domain sockets (w/o servers) Proto RefCnt Flags Type State I-Node Path unix 9 [ ] DGRAM7936 /dev/log unix 3 [ ] STREAM CONNECTED 11737 unix 3 [ ] STREAM CONNECTED 11736 unix 3 [ ] STREAM CONNECTED 10949 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 10948 unix 2 [ ] DGRAM10947 unix 2 [ ] STREAM CONNECTED 10801 unix 3 [ ] STREAM CONNECTED 10641 unix 3 [ ] STREAM CONNECTED 10640 unix 3 [ ] STREAM CONNECTED 10444 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 10443 unix 3 [ ] STREAM CONNECTED 10437 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 10436 unix 3 [ ] STREAM CONNECTED 10430 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 10429 unix 2 [ ] DGRAM10424 unix 3 [ ] STREAM CONNECTED 10422 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 10421 unix 2 [ ] DGRAM10420 unix 2 [ ] STREAM CONNECTED 10215 unix 2 [ ] STREAM CONNECTED 10296 unix 2 [ ] STREAM CONNECTED 9988 unix 2 [ ] DGRAM9520 unix 3 [ ] STREAM CONNECTED 8769 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 8768 unix 2 [ ] DGRAM8753 unix 2 [ ] DGRAM9422 unix 3 [ ] STREAM CONNECTED 7000 @/com/ubuntu/upstart unix 3 [ ] STREAM CONNECTED 8485 unix 2 [ ] DGRAM7947 unix 3 [ ] STREAM CONNECTED 6712 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 6711 unix 3 [ ] STREAM CONNECTED 7760 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 7759 unix 3 [ ] STREAM CONNECTED 7754 unix 3 [ ] STREAM CONNECTED 7753 unix 3 [ ] DGRAM7661 unix 3 [ ] DGRAM7660 unix 3 [ ] STREAM CONNECTED 6490 @/com/ubuntu/upstart unix 3 [ ] STREAM CONNECTED 6475 What is the issue here? Why I can't connect to Cassandra server? How can I fix this? Thank You! -- *Chamila Dilshan Wijayarathna,* Software Engineer Mobile:(+94)788193620 WSO2 Inc., http://wso2.com/
Re: Cassandra counters
Any pointers on this?. In 2.1, when updating the counter with UNLOGGED batch using timestamp isn't safe as other column update with consistency level (with timestamp counter update can be idempotent? ). Thanks Ajay On 09-Jul-2015 11:47 am, Ajay ajay.ga...@gmail.com wrote: Hi, What is the accuracy improvement of counter in 2.1 over 2.0? This below post, it mentioned 2.0.x issues fixed in 2.1 and perfomance improvement. http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters But how accurate are the counter 2.1.x or any known issues in 2.1 using UNLOGGED batch for counter update with timestamp? Thanks Ajay
Cassandra counters
Hi, What is the accuracy improvement of counter in 2.1 over 2.0? This below post, it mentioned 2.0.x issues fixed in 2.1 and perfomance improvement. http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters But how accurate are the counter 2.1.x or any known issues in 2.1 using UNLOGGED batch for counter update? Thanks Ajay
Re: Hbase vs Cassandra
started is easier with Cassandra. For HBase you need to run HDFS and Zookeeper, etc. * I've heard lots of anecdotes about Cassandra working nicely with small cluster ( 50 nodes) and quick degenerating above that. * HBase does not have a query language (but you can use Phoenix for full SQL support) * HBase does not have secondary indexes (having an eventually consistent index, similar to what Cassandra has, is easy in HBase, but making it as consistent as the rest of HBase is hard) Thanks Ajay On May 29, 2015, at 12:09 PM, Ajay ajay.ga...@gmail.com wrote: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: Hbase vs Cassandra
Hi Jens, All the points listed weren't from me. I posted the HBase Vs Cassandra in both the forums and consolidated here for the discussion. On Mon, Jun 8, 2015 at 2:27 PM, Jens Rantil jens.ran...@tink.se wrote: Hi, Some minor comments: 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. Not entirely sure what you mean here, but we ran Cloudera for a while and Cloudera Manager was buggy and hard to debug. Overall, our experience wasn't very good. This was definitely also due to us not knowing how all the Cloudera packages were configured. * This is the one of the response I got it from HBase forum. Datastax OpsCenter is there but seems it doesn't support the latest Cassandra versions (we tried it couple of times and there were bugs too)* HBase is always consistent. Machine outages lead to inability to read or write data on that machine. With Cassandra you can always write. Sort of true. You can decide write consistency and throw an exception if write didn't go through consistently. However, do note that Cassandra will never rollback failed writes which means writes aren't atomic (as in ACID). * If I understand correctly, you mean when we write with QUORUM and Cassandra writes to few machines and fails to write to few machines and throws exception if it doesn't satisfy QUORUM, leaving it inconsistent and doesn't rollback?. * We chose Cassandra over HBase mostly due to ease of managability. We are a small team, and my feeling is that you will want dedicated people taking care of a Hadoop cluster if you are going down the HBase path. A Cassandra cluster can be handled by a single engineer and is, in my opinion, easier to maintain. * This is the most popular reason for Cassandra over HBase. But this alone is not a sufficient driver. * Cheers, Jens On Mon, Jun 8, 2015 at 9:59 AM, Ajay ajay.ga...@gmail.com wrote: Hi All, Thanks for all the input. I posted the same question in HBase forum and got more response. Posting the consolidated list here. Our case is that a central team builds and maintain the platform (Cassandra as a service). We have couple of usecases which fits Cassandra like time-series data. But as a platform team, we need to know more features and usecases which fits or best handled in Cassandra. Also to understand the usecases where HBase performs better (we might need to have it as a service too). *Cassandra:* 1) From 2013 both can still be relevant: http://www.pythian.com/blog/watch-hbase-vs-cassandra/ 2) Here are some use cases from PlanetCassandra.org of companies who chose Cassandra over HBase after evaluation, or migrated to Cassandra from HBase. The eComNext interview cited on the page touches on time-series data; http://planetcassandra.org/hbase-to-cassandra-migration/ 3) From googling, the most popular advantages for Cassandra over HBase is easy to deploy, maintain monitor and no single point of failure. 4) From our six months research and POC experience in Cassandra, CQL is pretty limited. Though CQL is targeted for Real time Read and Write, there are cases where need to pull out data differently and we are OK with little more latency. But Cassandra doesn't support that. We need MapReduce or Spark for those. Then the debate starts why Cassandra and why not HBase if we need Hadoop/Spark for MapReduce. Expected a few more technical features/usecases that is best handled by Cassandra (and how it works). *HBase:* 1) As for the #4 you might be interested in reading https://aphyr.com/posts/294-call-me-maybe-cassandra Not sure if there is comparable article about HBase (anybody knows?) but it can give you another perspective about what else to keep an eye on regarding these systems. 2) See http://hbase.apache.org/book.html#perf.network.call_me_maybe 3) http://blog.parsely.com/post/1928/cass/ *Anyone have any comments on this?* 4) 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala 5) Migrated from Cassandra to HBase. Reasons: Scan is fast with HBase. It fits better with time series data model. Please look at opentsdb. Cassandra models it with large rows. Server side filtering. You can use to filter some of your time series data on the server side. Hbase has a better integration with hadoop in general. We had to write our own bulk loader using mapreduce for cassandra. hbase has already had a tool for that. There is a nice integration with flume and kite. High availability didnet matter for us. 10 secs down is fine for our use cases.HBase started to support eventually
Hbase vs Cassandra
Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: Caching the PreparedStatement (Java driver)
Hi Joseph, Java driver currently caches the prepared statements but using a weak reference i.e the cache will hold it as long the client code uses it. So in turn means that we need to cache the same. But I am also not sure of what happens when a cached prepared statement is executed after cassandra nodes restart. Does the server prepared statements cache is persisted or in memory?. If it is in memory, how do we handle stale prepared statement in the cache? Thanks Ajay On Fri, May 15, 2015 at 6:28 PM, ja jaa...@gmail.com wrote: Hi, Isn't it a good to have feature for the java driver to maintain a cache of PreparedStatements (PS) . Any reason why it's left to the application to do the same? . I am currently implementing a cache of PS that is loaded at app startup, but how do i ensure this cache is always good to use? . Say, there's a restart on the Cassandra server side, this cache would be stale and I assume the next use of a PS from cache would fail. Any way to recover from this. Thanks, Joseph On Sunday, March 1, 2015 at 12:46:14 AM UTC+5:30, Vishy Kasar wrote: On Feb 28, 2015, at 4:25 AM, Ajay ajay@gmail.com wrote: Hi, My earlier question was whether it is safe to cache PreparedStatement (using Java driver) in the client side for which I got it confirmed by Olivier. Now the question is do we really need to cache the PreparedStatement in the client side?. Lets take a scenario as below: 1) Client fires a REST query SELECT * from Test where Pk = val1; 2) REST service prepares a statement SELECT * from Test where Pk = ? 3) Executes the PreparedStatement by setting the values. 4) Assume we don't cache the PreparedStatement 5) Client fires another REST query SELECT * from Test where Pk = val2; 6) REST service prepares a statement SELECT * from Test where Pk = ? 7) Executes the PreparedStatement by setting the values. You should avoid re-preparing the statement (step 6 above). When you create a prepared statement, a round trip to server is involved. So you should create it once and reuse it. You can bind it with different values and execute the bound statement each time. In this case, is there any benefit of using the PreparedStatement? From the Java driver code, the Session.prepare(query) doesn't check whether a similar query was prepared earlier or not. It directly call the server passing the query. The return from the server is a PreparedId. Do the server maintains a cache of Prepared queries or it still perform the all the steps to prepare a query if the client calls to prepare the same query more than once (using the same Session and Cluster instance which I think doesn't matter)?. Thanks Ajay On Sat, Feb 28, 2015 at 9:17 AM, Ajay ajay@gmail.com wrote: Thanks Olivier. Most of the REST query calls would come from other applications to write/read to/from Cassandra which means most queries from an application would be same (same column families but different values). Thanks Ajay On 28-Feb-2015 6:05 am, Olivier Michallat olivier@datastax.com wrote: Hi Ajay, Yes, it is safe to hold a reference to PreparedStatement instances in your client code. If you always run the same pre-defined statements, you can store them as fields in your resource classes. If your statements are dynamically generated (for example, inserting different subsets of the columns depending on what was provided in the REST payload), your caching approach is valid. When you evict a PreparedStatement from your cache, the driver will also remove the corresponding id from its internal cache. If you re-prepare it later it might still be in the Cassandra-side cache, but that is not a problem. One caveat: you should be reasonably confident that your prepared statements will be reused. If your query strings are always different, preparing will bring no advantage. -- Olivier Michallat Driver tools engineer, DataStax On Fri, Feb 27, 2015 at 7:04 PM, Ajay ajay@gmail.com wrote: Hi, We are building REST APIs for Cassandra using the Cassandra Java Driver. So as per the below guidlines from the documentation, we are caching the Cluster instance (per cluster) and the Session instance (per keyspace) as they are multi thread safe. http://www.datastax.com/documentation/developer/java-driver/2.0/java-driver/fourSimpleRules.html As the Cluster and Session instance(s) are cached in the application already and also as the PreparedStatement provide better performance, we thought to build the PreparedStatement for REST query implicitly (as REST calls are stateless) and cache the PreparedStatemen. Whenever a REST query is invoked, we look for a PreparedStatement in the cache and create and put it in the cache if it doesn't exists. (The cache is a in-memory fixed size LRU based). Is a safe approach to cache PreparedStatement in the client side?. Looking at the Java driver code, the Cluster class stores the PreparedStatements
Re: Hive support on Cassandra
Thanks everyone. Basically we are looking at Hive because it supports advanced queries (CQL is limited to the data model). Does Stratio supports similar to Hive? Thanks Ajay On Thu, May 7, 2015 at 10:33 PM, Andres de la Peña adelap...@stratio.com wrote: You may also find interesting https://github.com/Stratio/crossdata. This project provides batch and streaming capabilities for Cassandra and others databases though a SQL-like language. Disclaimer: I am an employee of Stratio 2015-05-07 17:29 GMT+02:00 l...@airstreamcomm.net: You might also look at Apache Drill, which has support (I think alpha) for ANSI SQL queries against Cassandra if that would suit your needs. On May 6, 2015, at 12:57 AM, Ajay ajay.ga...@gmail.com wrote: Hi, Does Apache Cassandra (not DSE) support Hive Integration? I found couple of open source efforts but nothing is available currently. Thanks Ajay -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*
Hive support on Cassandra
Hi, Does Apache Cassandra (not DSE) support Hive Integration? I found couple of open source efforts but nothing is available currently. Thanks Ajay
When to use STCS/DTCS/LCS
Hi, What are the guidelines on when to use STCS/DTCS/LCS?. Most preferred way to test it with each of them and find the best fit. But is there some guidelines or best practices (out of experience) which one to use when? Thanks Ajay
Re: Availability testing of Cassandra nodes
Adding Java driver forum. Even we like to know more on this. - Ajay On Wed, Apr 8, 2015 at 8:15 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Just a couple of quick comments: 1. The driver is supposed to be doing availability and load balancing already. 2. If your cluster is lightly loaded, it isn't necessary to be so precise with load balancing. 3. If your cluster is heavily loaded, it won't help. Solution is to expand your cluster so that precise balancing of requests (beyond what the driver does) is not required. Is there anything special about your use case that you feel is worth the extra treatment? If you are having problems with the driver balancing requests and properly detecting available nodes or see some room for improvement, make sure to the issues so that they can be fixed. -- Jack Krupansky On Wed, Apr 8, 2015 at 10:31 AM, Jiri Horky ho...@avast.com wrote: Hi all, we are thinking of how to best proceed with availability testing of Cassandra nodes. It is becoming more and more apparent that it is rather complex task. We thought that we should try to read and write to each cassandra node to monitoring keyspace with a unique value with low TTL. This helps to find an issue but it also triggers flapping of unaffected hosts, as the key of the value which is beining inserted sometimes belongs to an affected host and sometimes not. Now, we could calculate the right value to insert so we can be sure it will hit the host we are connecting to, but then, you have replication factor and consistency level, so you can not be really sure that it actually tests ability of the given host to write values. So we ended up thinking that the best approach is to connect to each individual host, read some system keyspace (which might be on a different disk drive...), which should be local, and then check several JMX values that could indicate an error + JVM statitics (full heap, gc overhead). Moreover, we will more monitor our applications that are using cassandra (with mostly datastax driver) and try to get fail node information from them. How others do the testing? Jirka H.
Re: Stable cassandra build for production usage
Hi, Now that 2.0.13 is out, I don't see nodetool cleanup issue( https://issues.apache.org/jira/browse/CASSANDRA-8718) been fixed yet. The bug show priority Minor. Anybody facing this issue?. Thanks Ajay On Thu, Mar 12, 2015 at 11:41 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Mar 12, 2015 at 10:50 AM, Ajay ajay.ga...@gmail.com wrote: Please suggest what is the best option in this for production deployment in EC2 given that we are deploying Cassandra cluster for the 1st time (so likely that we add more data centers/nodes and schema changes in the initial few months) Voting for 2.0.13 is in process. I'd wait for that. But I don't need OpsCenter. =Rob
Re: Stable cassandra build for production usage
Yes we see https://issues.apache.org/jira/browse/CASSANDRA-8716 in our testing Thanks Ajay On Tue, Mar 17, 2015 at 3:20 PM, Marcus Eriksson krum...@gmail.com wrote: Do you see the segfault or do you see https://issues.apache.org/jira/browse/CASSANDRA-8716 ? On Tue, Mar 17, 2015 at 10:34 AM, Ajay ajay.ga...@gmail.com wrote: Hi, Now that 2.0.13 is out, I don't see nodetool cleanup issue( https://issues.apache.org/jira/browse/CASSANDRA-8718) been fixed yet. The bug show priority Minor. Anybody facing this issue?. Thanks Ajay On Thu, Mar 12, 2015 at 11:41 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Mar 12, 2015 at 10:50 AM, Ajay ajay.ga...@gmail.com wrote: Please suggest what is the best option in this for production deployment in EC2 given that we are deploying Cassandra cluster for the 1st time (so likely that we add more data centers/nodes and schema changes in the initial few months) Voting for 2.0.13 is in process. I'd wait for that. But I don't need OpsCenter. =Rob
Re: Adding a Cassandra node using OpsCenter
Is there a separate forum for Opscenter? Thanks Ajay On 11-Mar-2015 4:16 pm, Ajay ajay.ga...@gmail.com wrote: Hi, While adding a Cassandra node using OpsCenter (which is recommended), the versions of Cassandra (Datastax community edition) shows only 2.0.9 and not later versions in 2.0.x. Is there a reason behind it? 2.0.9 is recommended than 2.0.11? Thanks Ajay
Re: Stable cassandra build for production usage
Hi, We did our research using 2.0.11 version. While preparing for the production deployment, found out the following issues: 1) 2.0.12 has nodetool cleanup issue - https://issues.apache.org/jira/browse/CASSANDRA-8718 2) 2.0.11 has nodetool issue - https://issues.apache.org/jira/browse/CASSANDRA-8548 3) OpsCenter 5.1.0 supports only - 2.0.9 and not later 2.0.x - https://issues.apache.org/jira/browse/CASSANDRA-8072 4) 2.0.9 has schema refresh issue - https://issues.apache.org/jira/browse/CASSANDRA-7734 Please suggest what is the best option in this for production deployment in EC2 given that we are deploying Cassandra cluster for the 1st time (so likely that we add more data centers/nodes and schema changes in the initial few months) Thanks Ajay On Thu, Jan 1, 2015 at 9:49 PM, Neha Trivedi nehajtriv...@gmail.com wrote: Use 2.0.11 for production On Wed, Dec 31, 2014 at 11:50 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Dec 31, 2014 at 8:38 AM, Ajay ajay.ga...@gmail.com wrote: For my research and learning I am using Cassandra 2.1.2. But I see couple of mail threads going on issues in 2.1.2. So what is the stable or popular build for production in Cassandra 2.x series. https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ =Rob
Re: Steps to do after schema changes
Thanks Mark. - Ajay On 12-Mar-2015 11:08 pm, Mark Reddy mark.l.re...@gmail.com wrote: It's always good to run nodetool describecluster after a schema change, this will show you all the nodes in your cluster and what schema version they have. If they have different versions you have a schema disagreement and should follow this guide to resolution: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html Regards, Mark On 12 March 2015 at 05:47, Phil Yang ud1...@gmail.com wrote: Usually, you have nothing to do. Changes will be synced to every nodes automatically. 2015-03-12 13:21 GMT+08:00 Ajay ajay.ga...@gmail.com: Hi, Are there any steps to do (like nodetool or restart node) or any precautions after schema changes are done in a column family say adding a new column or modifying any table properties? Thanks Ajay -- Thanks, Phil Yang
Re: Adding a Cassandra node using OpsCenter
Thanks Nick. Does it mean that only adding a new node with 2.0.10 or later is a problem?. If a new node added manually can be monitored from Opscenter? Thanks Ajay On 12-Mar-2015 10:19 pm, Nick Bailey n...@datastax.com wrote: There isn't an OpsCenter specific mailing list no. To answer your question, the reason OpsCenter provisioning doesn't support 2.0.10 and 2.0.11 is due to https://issues.apache.org/jira/browse/CASSANDRA-8072. That bug unfortunately prevents OpsCenter provisioning from working correctly, but isn't serious outside of provisioning. OpsCenter may be able to come up with a workaround but at the moment those versions are unsupported. Sorry for inconvenience. -Nick On Thu, Mar 12, 2015 at 9:18 AM, Ajay ajay.ga...@gmail.com wrote: Is there a separate forum for Opscenter? Thanks Ajay On 11-Mar-2015 4:16 pm, Ajay ajay.ga...@gmail.com wrote: Hi, While adding a Cassandra node using OpsCenter (which is recommended), the versions of Cassandra (Datastax community edition) shows only 2.0.9 and not later versions in 2.0.x. Is there a reason behind it? 2.0.9 is recommended than 2.0.11? Thanks Ajay
Adding a Cassandra node using OpsCenter
Hi, While adding a Cassandra node using OpsCenter (which is recommended), the versions of Cassandra (Datastax community edition) shows only 2.0.9 and not later versions in 2.0.x. Is there a reason behind it? 2.0.9 is recommended than 2.0.11? Thanks Ajay
Steps to do after schema changes
Hi, Are there any steps to do (like nodetool or restart node) or any precautions after schema changes are done in a column family say adding a new column or modifying any table properties? Thanks Ajay
Re: Optimal Batch size (Unlogged) for Java driver
I have a column family with 15 columns where there are timestamp, timeuuid, few text fields and rest int fields. If I calculate the size of its column name and it's value and divide 5kb (recommended max size for batch) with the value, I get result as 12. Is it correct?. Am I missing something? Thanks Ajay On 02-Mar-2015 12:13 pm, Ankush Goyal ank...@gmail.com wrote: Hi Ajay, I would suggest, looking at the approximate size of individual elements in the batch, and based on that compute max size (chunk size). Its not really a straightforward calculation, so I would further suggest making that chunk size a runtime parameter that you can tweak and play around with until you reach stable state. On Sunday, March 1, 2015 at 10:06:55 PM UTC-8, Ajay Garga wrote: Hi, I am looking at a way to compute the optimal batch size in the client side similar to the below mentioned bug in the server side (generic as we are exposing REST APIs for Cassandra, the column family and the data are different each request). https://issues.apache.org/jira/browse/CASSANDRA-6487 https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-6487sa=Dsntz=1usg=AFQjCNGOSliZnS1idXqTHXIr7aNfEN3mMg How do we compute(approximately using ColumnDefintions or ColumnMetadata) the size of a row of a column family from the client side using Cassandra Java driver? Thanks Ajay To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Re: Optimal Batch size (Unlogged) for Java driver
Hi Ankush, We are already using Prepared statement and our case is a time series data as well. Thanks Ajay On 02-Mar-2015 10:00 pm, Ankush Goyal ank...@gmail.com wrote: Ajay, First of all, I would recommend using PreparedStatements, so you only would be sending the variable bound arguments over the wire. Second, I think that 5kb limit for WARN is too restrictive, and you could tune that on cassandra server side. I think if all you have is 15 columns (as long as their values are sanitized and do not go over certain limits), it should be fine to send all of them over at the same time. Chunking is necessary, when you have time-series type data (for writes) OR you might be reading a lot of data via IN query. On Monday, March 2, 2015 at 7:55:18 AM UTC-8, Ajay Garga wrote: I have a column family with 15 columns where there are timestamp, timeuuid, few text fields and rest int fields. If I calculate the size of its column name and it's value and divide 5kb (recommended max size for batch) with the value, I get result as 12. Is it correct?. Am I missing something? Thanks Ajay On 02-Mar-2015 12:13 pm, Ankush Goyal ank...@gmail.com wrote: Hi Ajay, I would suggest, looking at the approximate size of individual elements in the batch, and based on that compute max size (chunk size). Its not really a straightforward calculation, so I would further suggest making that chunk size a runtime parameter that you can tweak and play around with until you reach stable state. On Sunday, March 1, 2015 at 10:06:55 PM UTC-8, Ajay Garga wrote: Hi, I am looking at a way to compute the optimal batch size in the client side similar to the below mentioned bug in the server side (generic as we are exposing REST APIs for Cassandra, the column family and the data are different each request). https://issues.apache.org/jira/browse/CASSANDRA-6487 https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-6487sa=Dsntz=1usg=AFQjCNGOSliZnS1idXqTHXIr7aNfEN3mMg How do we compute(approximately using ColumnDefintions or ColumnMetadata) the size of a row of a column family from the client side using Cassandra Java driver? Thanks Ajay To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com. To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Optimal Batch size (Unlogged) for Java driver
Hi, I am looking at a way to compute the optimal batch size in the client side similar to the below mentioned bug in the server side (generic as we are exposing REST APIs for Cassandra, the column family and the data are different each request). https://issues.apache.org/jira/browse/CASSANDRA-6487 How do we compute(approximately using ColumnDefintions or ColumnMetadata) the size of a row of a column family from the client side using Cassandra Java driver? Thanks Ajay
Re: Caching the PreparedStatement (Java driver)
Hi, My earlier question was whether it is safe to cache PreparedStatement (using Java driver) in the client side for which I got it confirmed by Olivier. Now the question is do we really need to cache the PreparedStatement in the client side?. Lets take a scenario as below: 1) Client fires a REST query SELECT * from Test where Pk = val1; 2) REST service prepares a statement SELECT * from Test where Pk = ? 3) Executes the PreparedStatement by setting the values. 4) Assume we don't cache the PreparedStatement 5) Client fires another REST query SELECT * from Test where Pk = val2; 6) REST service prepares a statement SELECT * from Test where Pk = ? 7) Executes the PreparedStatement by setting the values. In this case, is there any benefit of using the PreparedStatement? From the Java driver code, the Session.prepare(query) doesn't check whether a similar query was prepared earlier or not. It directly call the server passing the query. The return from the server is a PreparedId. Do the server maintains a cache of Prepared queries or it still perform the all the steps to prepare a query if the client calls to prepare the same query more than once (using the same Session and Cluster instance which I think doesn't matter)?. Thanks Ajay On Sat, Feb 28, 2015 at 9:17 AM, Ajay ajay.ga...@gmail.com wrote: Thanks Olivier. Most of the REST query calls would come from other applications to write/read to/from Cassandra which means most queries from an application would be same (same column families but different values). Thanks Ajay On 28-Feb-2015 6:05 am, Olivier Michallat olivier.michal...@datastax.com wrote: Hi Ajay, Yes, it is safe to hold a reference to PreparedStatement instances in your client code. If you always run the same pre-defined statements, you can store them as fields in your resource classes. If your statements are dynamically generated (for example, inserting different subsets of the columns depending on what was provided in the REST payload), your caching approach is valid. When you evict a PreparedStatement from your cache, the driver will also remove the corresponding id from its internal cache. If you re-prepare it later it might still be in the Cassandra-side cache, but that is not a problem. One caveat: you should be reasonably confident that your prepared statements will be reused. If your query strings are always different, preparing will bring no advantage. -- Olivier Michallat Driver tools engineer, DataStax On Fri, Feb 27, 2015 at 7:04 PM, Ajay ajay.ga...@gmail.com wrote: Hi, We are building REST APIs for Cassandra using the Cassandra Java Driver. So as per the below guidlines from the documentation, we are caching the Cluster instance (per cluster) and the Session instance (per keyspace) as they are multi thread safe. http://www.datastax.com/documentation/developer/java-driver/2.0/java-driver/fourSimpleRules.html As the Cluster and Session instance(s) are cached in the application already and also as the PreparedStatement provide better performance, we thought to build the PreparedStatement for REST query implicitly (as REST calls are stateless) and cache the PreparedStatemen. Whenever a REST query is invoked, we look for a PreparedStatement in the cache and create and put it in the cache if it doesn't exists. (The cache is a in-memory fixed size LRU based). Is a safe approach to cache PreparedStatement in the client side?. Looking at the Java driver code, the Cluster class stores the PreparedStatements as a weak reference (to rebuild when a node is down or a new node added). Thanks Ajay To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com. To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Caching the PreparedStatement (Java driver)
Hi, We are building REST APIs for Cassandra using the Cassandra Java Driver. So as per the below guidlines from the documentation, we are caching the Cluster instance (per cluster) and the Session instance (per keyspace) as they are multi thread safe. http://www.datastax.com/documentation/developer/java-driver/2.0/java-driver/fourSimpleRules.html As the Cluster and Session instance(s) are cached in the application already and also as the PreparedStatement provide better performance, we thought to build the PreparedStatement for REST query implicitly (as REST calls are stateless) and cache the PreparedStatemen. Whenever a REST query is invoked, we look for a PreparedStatement in the cache and create and put it in the cache if it doesn't exists. (The cache is a in-memory fixed size LRU based). Is a safe approach to cache PreparedStatement in the client side?. Looking at the Java driver code, the Cluster class stores the PreparedStatements as a weak reference (to rebuild when a node is down or a new node added). Thanks Ajay
Re: Pagination support on Java Driver Query API
The syntax suggested by Ondrej is not working in some case in 2.0.11 and logged an issue for the same. https://issues.apache.org/jira/browse/CASSANDRA-8797 Thanks Ajay On Feb 12, 2015 11:01 PM, Bulat Shakirzyanov bulat.shakirzya...@datastax.com wrote: Fixed my Mail.app settings so you can see my actual name, sorry. On Feb 12, 2015, at 8:55 AM, DataStax bulat.shakirzya...@datastax.com wrote: Hello, As was mentioned earlier, the Java driver doesn’t actually perform pagination. Instead, it uses cassandra native protocol to set page size of the result set. ( https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v2.spec#L699-L730 ) When Cassandra sends the result back to the java driver, it includes a some binary token. This token represents paging state. To fetch the next page, the driver re-executes the same statement with original page size and paging state attached. If there is another page available, Cassandra responds with a new paging state that can be used to fetch it. You could also try reporting this issue on the Cassandra user mailing list. On Feb 12, 2015, at 8:35 AM, Eric Stevens migh...@gmail.com wrote: I don't know what the shape of the page state data is deep inside the JavaDriver, I've actually tried to dig into that in the past and understand it to see if I could reproduce it as a general purpose any-query kind of thing. I gave up before I fully understood it, but I think it's actually a handle to an in-memory state maintained by the coordinator, which is only maintained for the lifetime of the statement (i.e. it's not stateless paging). That would make it a bad candidate for stateless paging scenarios such as REST requests where a typical setup would load balance across HTTP hosts, never mind across coordinators. It shouldn't be too much work to abstract this basic idea for manual paging into a general purpose class that takes List[ClusteringKeyDef[T, O:Ordering]], and can produce a connection agnostic PageState from a ResultSet or Row, or accepts a PageState to produce a WHERE CQL fragment. Also RE: possibly multiple queries to satisfy a page - yes, that's unfortunate. Since you're on 2.0.11, see Ondřej's answer to avoid it. On Thu, Feb 12, 2015 at 8:13 AM, Ajay ajay.ga...@gmail.com wrote: Thanks Eric. I figured out the same but didn't get time to put it on the mail. Thanks. But it is highly tied up to how data is stored internally in Cassandra. Basically how partition keys are used to distribute (less likely to change. We are not directly dependence on the partition algo) and clustering keys are used to sort the data with in a partition( multi level sorting and henceforth the restrictions on the ORDER BY clause) which I think can change likely down the lane in Cassandra 3.x or 4.x in an different way for some better storage or retrieval. Thats said I am hesitant to implement this client side logic for pagination for a) 2+ queries might need more than one query to Cassandra. b) tied up implementation to Cassandra internal storage details which can change(though not often). c) in our case, we are building REST Apis which will be deployed Tomcat clusters. Hence whatever we cache to support pagination, need to be cached in a distributed way for failover support. It (pagination support) is best done at the server side like ROWNUM in SQL or better done in Java driver to hide the internal details and can be optimized better as server sends the paging state with the driver. Thanks Ajay On Feb 12, 2015 8:22 PM, Eric Stevens migh...@gmail.com wrote: Your page state then needs to track the last ck1 and last ck2 you saw. Pages 2+ will end up needing to be up to two queries if the first query doesn't fill the page size. CREATE TABLE foo ( partitionkey int, ck1 int, ck2 int, col1 int, col2 int, PRIMARY KEY ((partitionkey), ck1, ck2) ) WITH CLUSTERING ORDER BY (ck1 asc, ck2 desc); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,1,1,1); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,2,2,2); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,3,3,3); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,1,4,4); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,2,5,5); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,3,6,6); If you're pulling the whole of partition 1 and your page size is 2, your first page looks like: *PAGE 1* SELECT * FROM foo WHERE partitionkey = 1 LIMIT 2; partitionkey | ck1 | ck2 | col1 | col2 --+-+-+--+-- 1 | 1 | 3 |3 |3 1 | 1 | 2 |2 |2 You got enough rows to satisfy the page, Your page state is taken from the last row: (ck1=1, ck2=2) *PAGE 2* Notice that you have a page state, and add some limiting clauses on the statement: SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 1
Re: Pagination support on Java Driver Query API
Thanks Eric. I figured out the same but didn't get time to put it on the mail. Thanks. But it is highly tied up to how data is stored internally in Cassandra. Basically how partition keys are used to distribute (less likely to change. We are not directly dependence on the partition algo) and clustering keys are used to sort the data with in a partition( multi level sorting and henceforth the restrictions on the ORDER BY clause) which I think can change likely down the lane in Cassandra 3.x or 4.x in an different way for some better storage or retrieval. Thats said I am hesitant to implement this client side logic for pagination for a) 2+ queries might need more than one query to Cassandra. b) tied up implementation to Cassandra internal storage details which can change(though not often). c) in our case, we are building REST Apis which will be deployed Tomcat clusters. Hence whatever we cache to support pagination, need to be cached in a distributed way for failover support. It (pagination support) is best done at the server side like ROWNUM in SQL or better done in Java driver to hide the internal details and can be optimized better as server sends the paging state with the driver. Thanks Ajay On Feb 12, 2015 8:22 PM, Eric Stevens migh...@gmail.com wrote: Your page state then needs to track the last ck1 and last ck2 you saw. Pages 2+ will end up needing to be up to two queries if the first query doesn't fill the page size. CREATE TABLE foo ( partitionkey int, ck1 int, ck2 int, col1 int, col2 int, PRIMARY KEY ((partitionkey), ck1, ck2) ) WITH CLUSTERING ORDER BY (ck1 asc, ck2 desc); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,1,1,1); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,2,2,2); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,3,3,3); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,1,4,4); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,2,5,5); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,3,6,6); If you're pulling the whole of partition 1 and your page size is 2, your first page looks like: *PAGE 1* SELECT * FROM foo WHERE partitionkey = 1 LIMIT 2; partitionkey | ck1 | ck2 | col1 | col2 --+-+-+--+-- 1 | 1 | 3 |3 |3 1 | 1 | 2 |2 |2 You got enough rows to satisfy the page, Your page state is taken from the last row: (ck1=1, ck2=2) *PAGE 2* Notice that you have a page state, and add some limiting clauses on the statement: SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 1 AND ck2 2 LIMIT 2; partitionkey | ck1 | ck2 | col1 | col2 --+-+-+--+-- 1 | 1 | 1 |1 |1 Oops, we didn't get enough rows to satisfy the page limit, so we need to continue on, we just need one more: SELECT * FROM foo WHERE partitionkey = 1 AND ck1 1 LIMIT 1; partitionkey | ck1 | ck2 | col1 | col2 --+-+-+--+-- 1 | 2 | 3 |6 |6 We have enough to satisfy page 2 now, our new page state: (ck1 = 2, ck2 = 3). *PAGE 3* SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 2 AND ck2 3 LIMIT 2; partitionkey | ck1 | ck2 | col1 | col2 --+-+-+--+-- 1 | 2 | 2 |5 |5 1 | 2 | 1 |4 |4 Great, we satisfied this page with only one query, page state: (ck1 = 2, ck2 = 1). *PAGE 4* SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 2 AND ck2 1 LIMIT 2; (0 rows) Oops, our initial query was on the boundary of ck1, but this looks like any other time that the initial query returns pageSize rows, we just move on to the next page: SELECT * FROM foo WHERE partitionkey = 1 AND ck1 2 LIMIT 2; (0 rows) Aha, we've exhausted ck1 as well, so there are no more pages, page 3 actually pulled the last possible value; page 4 is empty, and we're all done. Generally speaking you know you're done when your first clustering key is the only non-equality operator in the statement, and you got no rows back. On Wed, Feb 11, 2015 at 10:55 AM, Ajay ajay.ga...@gmail.com wrote: Basically I am trying different queries with your approach. One such query is like Select * from mycf where condition on partition key order by ck1 asc, ck2 desc where ck1 and ck2 are clustering keys in that order. Here how do we achieve pagination support? Thanks Ajay On Feb 11, 2015 11:16 PM, Ajay ajay.ga...@gmail.com wrote: Hi Eric, Thanks for your reply. I am using Cassandra 2.0.11 and in that I cannot append condition like last clustering key column value of the last row in the previous batch. It fails Preceding column is either not restricted or by a non-EQ relation. It means I need to specify equal condition for all preceding clustering key columns. With this I cannot get the pagination correct. Thanks
Re: Pagination support on Java Driver Query API
Hi Eric, Thanks for your reply. I am using Cassandra 2.0.11 and in that I cannot append condition like last clustering key column value of the last row in the previous batch. It fails Preceding column is either not restricted or by a non-EQ relation. It means I need to specify equal condition for all preceding clustering key columns. With this I cannot get the pagination correct. Thanks Ajay I can't believe that everyone read process all rows at once (without pagination). Probably not too many people try to read all rows in a table as a single rolling operation with a standard client driver. But those who do would use token() to keep track of where they are and be able to resume with that as well. But it sounds like you're talking about paginating a subset of data - larger than you want to process as a unit, but prefiltered by some other criteria which prevents you from being able to rely on token(). For this there is no general purpose solution, but it typically involves you maintaining your own paging state, typically keeping track of the last partitioning and clustering key seen, and using that to construct your next query. For example, we have client queries which can span several partitioning keys. We make sure that the List of partition keys generated by a given client query List(Pq) is deterministic, then our paging state is the index offset of the final Pq in the response, plus the value of the final clustering column. A query coming in with a paging state attached to it starts the next set of queries from the provided Pq offset where clusteringKey the provided value. So if you can just track partition key offset (if spanning multiple partitions), and clustering key offset, you can construct your next query from those instead. On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote: Thanks Alex. But is there any workaround possible?. I can't believe that everyone read process all rows at once (without pagination). Thanks Ajay On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote: On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote: 1) Java driver implicitly support Pagination in the ResultSet (using Iterator) which can be controlled through FetchSize. But it is limited in a way that we cannot skip or go previous. The FetchState is not exposed. Cassandra doesn't support skipping so this is not really a limitation of the driver. -- [:-a) Alex Popescu Sen. Product Manager @ DataStax @al3xandru To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Re: Pagination support on Java Driver Query API
Basically I am trying different queries with your approach. One such query is like Select * from mycf where condition on partition key order by ck1 asc, ck2 desc where ck1 and ck2 are clustering keys in that order. Here how do we achieve pagination support? Thanks Ajay On Feb 11, 2015 11:16 PM, Ajay ajay.ga...@gmail.com wrote: Hi Eric, Thanks for your reply. I am using Cassandra 2.0.11 and in that I cannot append condition like last clustering key column value of the last row in the previous batch. It fails Preceding column is either not restricted or by a non-EQ relation. It means I need to specify equal condition for all preceding clustering key columns. With this I cannot get the pagination correct. Thanks Ajay I can't believe that everyone read process all rows at once (without pagination). Probably not too many people try to read all rows in a table as a single rolling operation with a standard client driver. But those who do would use token() to keep track of where they are and be able to resume with that as well. But it sounds like you're talking about paginating a subset of data - larger than you want to process as a unit, but prefiltered by some other criteria which prevents you from being able to rely on token(). For this there is no general purpose solution, but it typically involves you maintaining your own paging state, typically keeping track of the last partitioning and clustering key seen, and using that to construct your next query. For example, we have client queries which can span several partitioning keys. We make sure that the List of partition keys generated by a given client query List(Pq) is deterministic, then our paging state is the index offset of the final Pq in the response, plus the value of the final clustering column. A query coming in with a paging state attached to it starts the next set of queries from the provided Pq offset where clusteringKey the provided value. So if you can just track partition key offset (if spanning multiple partitions), and clustering key offset, you can construct your next query from those instead. On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote: Thanks Alex. But is there any workaround possible?. I can't believe that everyone read process all rows at once (without pagination). Thanks Ajay On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote: On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote: 1) Java driver implicitly support Pagination in the ResultSet (using Iterator) which can be controlled through FetchSize. But it is limited in a way that we cannot skip or go previous. The FetchState is not exposed. Cassandra doesn't support skipping so this is not really a limitation of the driver. -- [:-a) Alex Popescu Sen. Product Manager @ DataStax @al3xandru To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Re: Pagination support on Java Driver Query API
Thanks Alex. But is there any workaround possible?. I can't believe that everyone read process all rows at once (without pagination). Thanks Ajay On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote: On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote: 1) Java driver implicitly support Pagination in the ResultSet (using Iterator) which can be controlled through FetchSize. But it is limited in a way that we cannot skip or go previous. The FetchState is not exposed. Cassandra doesn't support skipping so this is not really a limitation of the driver. -- [:-a) Alex Popescu Sen. Product Manager @ DataStax @al3xandru To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Pagination support on Java Driver Query API
Hi, I am working on exposing the Cassandra Query APIs(Java Driver) as REST APIs for our internal project. To support Pagination, I looked at the Cassandra documentation, Source code and other forums. What I mean by pagination support is like below: 1) Client fires query to REST server 2) Server prepares the statement, caches the query and return a query id (unique id) 3) Get the query id, offset and limit and return the set of rows according to the offset and limit and also return the last returned row offset. 4) Client make subsequent calls to the server with the offset returned by the server until all rows are returned. In case once call fails or times out, the client will make a call again. Below are the details I found: 1) Java driver implicitly support Pagination in the ResultSet (using Iterator) which can be controlled through FetchSize. But it is limited in a way that we cannot skip or go previous. The FetchState is not exposed. 2) Using token() function on the clustering keys of the last returned row, we can skip the returned rows and using the LIMIT keyword, we can limit the number of rows. But the problem I see is that the token() function cannot be used if the query contains ORDER BY clause. Is there any other way to achieve the pagination support? Thanks Ajay
Re: Performance difference between Regular Statement Vs PreparedStatement
Thanks Eric. I didn't know the point about the token aware routing. But with points 2 and 3 I didn't notice much improvement with prepared statement. I have 2 cassandra nodes running in virtual boxes in the same machine and test client running in the same machine. Thanks Ajay Prepared statements can take advantage of token aware routing which IIRC non-prepared statements cannot in the DS Java Driver, so as your cluster grows you reduce the overhead of statement coordination (assuming you use token aware routing). There should also be less data to transfer for shipping the query (the CQL portion is shipped once during the prepare stage, and only the data is shipped on subsequent executions). You'll also save the cluster the overhead of repeatedly parsing your CQL statements. On Wed, Jan 28, 2015 at 11:50 PM, Ajay ajay.ga...@gmail.com wrote: Hi All, I tried both insert and select query (using QueryBuilder) in Regular statement and PreparedStatement in a multithreaded code to do the query say 10k to 50k times. But I don't see any visible improvement using the PreparedStatement. What could be the reason? Note : I am using the same Session object in multiple threads. Cassandra version : 2.0.11 Driver version : 2.1.4 Thanks Ajay
Re: User audit in Cassandra
Thanks Tyler Hobbs. We need to capture what are the queries ran by a user in a session and its time taken. (don't need query plan or so). Is that possible? With Authenticator we can capture only the session creation right? Thanks Ajay On Sat, Jan 10, 2015 at 6:07 AM, Tyler Hobbs ty...@datastax.com wrote: system_traces is for query tracing, which is for diagnosing performance problems, not logging activity. Cassandra is designed to allow you to write your own Authenticator pretty easily. You can just subclass PasswordAuthenticator and add logging where desired. Compile that into a jar, put it in the lib/ directory for Cassandra, and change cassandra.yaml to use that class. On Thu, Jan 8, 2015 at 6:34 AM, Ajay ajay.ga...@gmail.com wrote: Hi, Is there a way to enable user audit or trace if we have enabled PasswordAuthenticator in cassandra.yaml and set up the users as well. I noticed there are keyspaces system_auth and system_trace. But there is no way to find out which user initiated which session. Is there anyway to find out?. Also is it recommended to enable system_trace in production or to know how many sessions started by a user? Thanks Ajay -- Tyler Hobbs DataStax http://datastax.com/
Re: Cassandra primary key design to cater range query
Hi, I read somewhere that the order of columns in the cluster key matters. Please correct me if I am wrong. For example, PRIMARY KEY((prodgroup), status, productid). Then the below query cannot run, select * from product where prodgroup='xyz' and prodid 0 But this query can be run: select * from product where prodgroup='xyz' and prodid 0 and status = 0 It means all the preceding part of the clustering key has to be provided in the query. So with that, if you want to query Get details of a specific product(either active or inactive), you might need to reorder the columns like PRIMARY KEY((prodgroup), productid, status). Thanks Ajay On Sat, Jan 10, 2015 at 6:03 AM, Tyler Hobbs ty...@datastax.com wrote: Your proposed model for the table to handle the last query looks good, so I would stick with that. On Mon, Jan 5, 2015 at 5:45 AM, Nagesh nageswara.r...@gmail.com wrote: Hi All, I have designed a column family prodgroup text, prodid int, status int, , PRIMARY KEY ((prodgroup), prodid, status) The data model is to cater - Get list of products from the product group - get list of products for a given range of ids - Get details of a specific product - Update status of the product acive/inactive - Get list of products that are active or inactive (select * from product where prodgroup='xyz' and prodid 0 and status = 0) The design works fine, except for the last query . Cassandra not allowing to query on status unless I fix the product id. I think defining a super column family which has the key PRIMARY KEY((prodgroup), staus, productid) should work. Would like to get expert advice on other alternatives. -- Thanks, Nageswara Rao.V *The LORD reigns* -- Tyler Hobbs DataStax http://datastax.com/
User audit in Cassandra
Hi, Is there a way to enable user audit or trace if we have enabled PasswordAuthenticator in cassandra.yaml and set up the users as well. I noticed there are keyspaces system_auth and system_trace. But there is no way to find out which user initiated which session. Is there anyway to find out?. Also is it recommended to enable system_trace in production or to know how many sessions started by a user? Thanks Ajay
Token function in CQL for composite partition key
Hi, I have a column family as below: (Wide row design) CREATE TABLE clicks (hour text,adId int,itemId int,time timeuuid,PRIMARY KEY((adId, hour), time, itemId)) WITH CLUSTERING ORDER BY (time DESC); Now to query for a given Ad Id and specific 3 hours say 2015-01-07 11 to 2015-01-07 14, how do I use the token function in the CQL. Thanks Ajay
Re: Token function in CQL for composite partition key
Thanks. Basically there are two access patterns: 1) For last 1 hour (or more if last batch failed for some reason), get the clicks data for all Ads. But it seems not possible as Ad Id is part of Partition key. 2) For last 1 hour (or more if last batch failed for some reason), get the clicks data for a specific Ad Id(one or more may be). How do we support 1 and 2 with a same data model? (I thought to use Ad ID + Hour data as Partition key to avoid hotspots) Thanks Ajay On Wed, Jan 7, 2015 at 6:34 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Jan 7, 2015 at 10:18 AM, Ajay ajay.ga...@gmail.com wrote: Hi, I have a column family as below: (Wide row design) CREATE TABLE clicks (hour text,adId int,itemId int,time timeuuid,PRIMARY KEY((adId, hour), time, itemId)) WITH CLUSTERING ORDER BY (time DESC); Now to query for a given Ad Id and specific 3 hours say 2015-01-07 11 to 2015-01-07 14, how do I use the token function in the CQL. From that description, it doesn't appear to me that you need the token function. Just do 3 queries for each hour, each queries being something along the lines of SELECT * FROM clicks WHERE adId=... AND hour='2015-01-07 11' AND ... For completness sake, I should note that you could do that with a single query by using an IN on the hour column, but it's actually not a better solution (provided you submit the 3 queries in an asynchronous fashion at least) in that case because of reason explained here: https://medium.com/@foundev/cassandra-query-patterns-not-using-the-in-query-e8d23f9b17c7 . -- Sylvain
Re: Keyspace uppercase name issues
We noticed the same issue. From the cassandra-cli, it allows to use upper case or mixed case Keyspace name but from cqlsh it auto converts to lower case. Thanks Ajay On Wed, Jan 7, 2015 at 9:44 PM, Harel Gliksman harelg...@gmail.com wrote: Hi, We have a Cassandra cluster with Keyspaces that were created using the thrift api and thei names contain upper case letters. We are trying to use the new Datastax driver (version 2.1.4, maven's latest ) but encountering some problems due to upper case handling. Datastax provide this guidance on how to handle lower-upper cases: http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/ucase-lcase_r.html However, there seems to be something confusing in the API. Attached a small java code that reproduces the problem. Many thanks, Harel.
Re: Cassandra nodes in VirtualBox
Neha, This is just for a trial set up. Anyway, thanks for the suggestion(more than 1 seed node). I figured out the problem. The Node2 was having the incorrect Cluster name. The error seems to be misleading though. Thanks Ajay Garga On Mon, Jan 5, 2015 at 4:21 PM, Neha Trivedi nehajtriv...@gmail.com wrote: Hi Ajay, 1. you should have at least 2 Seed nodes as it will help, Node1 (only one seed node) is down. 2. Check you should be using internal ip address in listen_address and rpc_address. On Mon, Jan 5, 2015 at 2:07 PM, Ajay ajay.ga...@gmail.com wrote: Hi, I did the Cassandra cluster set up as below: Node 1 : Seed Node Node 2 Node 3 Node 4 All 4 nodes are Virtual Box VMs with Ubuntu 14.10. I have set the listen_address, rpc_address as the inet address with SimpleSnitch. When I start Node2 after Node1 is started, I get the java.lang.RuntimeException: Unable to news with any seeds. What could be the reason? Thanks Ajay
Cassandra nodes in VirtualBox
Hi, I did the Cassandra cluster set up as below: Node 1 : Seed Node Node 2 Node 3 Node 4 All 4 nodes are Virtual Box VMs with Ubuntu 14.10. I have set the listen_address, rpc_address as the inet address with SimpleSnitch. When I start Node2 after Node1 is started, I get the java.lang.RuntimeException: Unable to news with any seeds. What could be the reason? Thanks Ajay
Re: User click count
Thanks Eric. Happy new year 2015 for all Cassandra developers and Users :). This group seems the most active of apache big data projects. Will come back with more questions :) Thanks Ajay On Dec 31, 2014 8:02 PM, Eric Stevens migh...@gmail.com wrote: You can totally avoid the impact of tombstones by rotating your partition key in the exact counts table, and only deleting whole partitions once you've counted them. Once you've counted them you never have cause to read that partition key again. You can totally store the final counts in Cassandra as a standard (non-counter) column, and you can even use counters to keep track of the time slices which haven't been formally counted yet so that you can get reasonably accurate information about time slices that haven't been trued up yet. This is basically what's called a Lambda architecture - use efficient real time processing to get pretty close to accurate values when real time performance matters, then use a cleanup process to get perfectly accurate values when you can afford non-real-time processing times, and store that final computation so that you can continue to access it quickly. is there any technical reason behind it (just out of curiosity)? Distributed counting is a fundamentally hard problem if you wish to do so in a manner that avoids bottlenecks (i.e. not distributed) and also provides for perfect accuracy. There's plenty of research in this area, and there isn't a single algorithm that provides for all the properties we would hope for. Instead there are different algorithms that make different tradeoffs. The way that Cassandra's counters can fail is that most operations in Cassandra are idempotent - if we're not sure whether an update has been applied correctly or not, we can simply apply it again, because it's safe to do twice. Counters are not idempotent. If you try to increment a counter, and you're not certain whether the increment was successful or not, it is *not* safe to try again (if it was successful the previous time, you've now incremented twice when it should have been once). Most of the time counters are reasonable and accurate, but in failure scenarios you may get some changes applied more than once, or not at all. With that in mind, you might find that being perfectly accurate most of the time, and being within a fraction of a percent the other times is acceptable. If so, counters are your friend, and if not, a more complex lambda style approach as we've been advocating here is best. On Tue, Dec 30, 2014 at 10:54 PM, Ajay ajay.ga...@gmail.com wrote: Thanks Janne and Rob. The idea is like this : To store the User clicks on Cassandra and a scheduler to count/aggregate the clicks per link or ad hourly/daily/monthly and store in My SQL (or may be in Cassandra itself). Since tombstones will be deleted only after some days (as per configuration), could the subsequent queries to count the rows get affected (I mean say thousands of tombstones will affect the performance of the query) ? Secondly as I understand from this mail thread, the counter is not correct for this use case, is there any technical reason behind it (just out of curiosity)? Thanks Ajay On Tue, Dec 30, 2014 at 10:37 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: Hi! Yes, since all the writes for a partition (or row if you speak Thrift) always go to the same replicas, you will need to design to avoid hotspots - a pure day row will cause all the writes for a single day to go to the same replicas, so those nodes will have to work really hard for a day, and then the next day it’s again hard work for some other nodes. If you have an user id there in front, then it would distribute better. For tombstone purposes think of your access patterns; if you have a date-based system, it probably does not matter since you will scan those UUIDs once, and then they will be tombstoned away. It’s cleaner if you can delete the entire row with a single command, but as long as you never read it again, I don’t think this matters much. The real problems with wide rows come with compaction, and you shouldn’t have much problems with compaction because this is an append-only row, so it should be fine as a fairly wide row. Make some back-of-the-envelope calculations and if it looks like you’re going to be hitting tens of millions of columns per day, then store per hour. One important thing: in order not to lose clicks, always use timeuuids instead of timestamps (or else two clicks coming in for the same id would overwrite itself and count as one). /Janne On 30 Dec 2014, at 06:28, Ajay ajay.ga...@gmail.com wrote: Thanks Janne, Alain and Eric. Now say I go with counters (hourly, daily, monthly) and also store UUID as below: user Id : /mm/dd as row key and dynamic columns for each click with column key as timestamp and value as empty. Periodically count the columns and rows
Stable cassandra build for production usage
Hi All, For my research and learning I am using Cassandra 2.1.2. But I see couple of mail threads going on issues in 2.1.2. So what is the stable or popular build for production in Cassandra 2.x series. Thanks Ajay
Re: User click count
Thanks Janne and Rob. The idea is like this : To store the User clicks on Cassandra and a scheduler to count/aggregate the clicks per link or ad hourly/daily/monthly and store in My SQL (or may be in Cassandra itself). Since tombstones will be deleted only after some days (as per configuration), could the subsequent queries to count the rows get affected (I mean say thousands of tombstones will affect the performance of the query) ? Secondly as I understand from this mail thread, the counter is not correct for this use case, is there any technical reason behind it (just out of curiosity)? Thanks Ajay On Tue, Dec 30, 2014 at 10:37 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: Hi! Yes, since all the writes for a partition (or row if you speak Thrift) always go to the same replicas, you will need to design to avoid hotspots - a pure day row will cause all the writes for a single day to go to the same replicas, so those nodes will have to work really hard for a day, and then the next day it’s again hard work for some other nodes. If you have an user id there in front, then it would distribute better. For tombstone purposes think of your access patterns; if you have a date-based system, it probably does not matter since you will scan those UUIDs once, and then they will be tombstoned away. It’s cleaner if you can delete the entire row with a single command, but as long as you never read it again, I don’t think this matters much. The real problems with wide rows come with compaction, and you shouldn’t have much problems with compaction because this is an append-only row, so it should be fine as a fairly wide row. Make some back-of-the-envelope calculations and if it looks like you’re going to be hitting tens of millions of columns per day, then store per hour. One important thing: in order not to lose clicks, always use timeuuids instead of timestamps (or else two clicks coming in for the same id would overwrite itself and count as one). /Janne On 30 Dec 2014, at 06:28, Ajay ajay.ga...@gmail.com wrote: Thanks Janne, Alain and Eric. Now say I go with counters (hourly, daily, monthly) and also store UUID as below: user Id : /mm/dd as row key and dynamic columns for each click with column key as timestamp and value as empty. Periodically count the columns and rows and correct the counters. Now in this case, there will be one row per day but as many columns as user click. Other way is to store row per hour user id : /mm/dd/hh as row key and dynamic columns for each click with column key as timestamp and value as empty. Is there any difference (in performance or any known issues) between more rows Vs more columns as Cassandra deletes them through tombstones (say by default 20 days). Thanks Ajay On Mon, Dec 29, 2014 at 7:47 PM, Eric Stevens migh...@gmail.com wrote: If the counters get incorrect, it could't be corrected You'd have to store something that allowed you to correct it. For example, the TimeUUID approach to keep true counts, which are slow to read but accurate, and a background process that trues up your counter columns periodically. On Mon, Dec 29, 2014 at 7:05 AM, Ajay ajay.ga...@gmail.com wrote: Thanks for the clarification. In my case, Cassandra is the only storage. If the counters get incorrect, it could't be corrected. For that if we store raw data, we can as well go that approach. But the granularity has to be as seconds level as more than one user can click the same link. So the data will be huge with more writes and more rows to count for reads right? Thanks Ajay On Mon, Dec 29, 2014 at 7:10 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Ajay, Here is a good explanation you might want to read. http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters Though we use counters for 3 years now, we used them from start C* 0.8 and we are happy with them. Limits I can see in both ways are: Counters: - accuracy indeed (Tend to be small in our use case 5% - when the business allow 10%, so fair enough for us) + we recount them through a batch processing tool (spark / hadoop - Kind of lambda architecture). So our real-time stats are inaccurate and after a few minutes or hours we have the real value. - Read-Before-Write model, which is an anti-pattern. Makes you use more machine due to the pressure involved, affordable for us too. Raw data (counted) - Space used (can become quite impressive very fast, depending on your business) ! - Time to answer a request (we expose the data to customer, they don't want to wait 10 sec for Cassandra to read 1 000 000 + columns) - Performances in o(n) (linear) instead of o(1) (constant). Customer won't always understand that for you it is harder to read 1 than 1 000 000, since it should be reading 1 number in both case, and your interface will have very unstable read time. Pick the best solution
User click count
Hi, Is it better to use Counter to User click count than maintaining creating new row as user id : timestamp and count it. Basically we want to track the user clicks and use the same for hourly/daily/monthly report. Thanks Ajay
Re: User click count
Hi, So you mean to say counters are not accurate? (It is highly likely that multiple parallel threads trying to increment the counter as users click the links). Thanks Ajay On Mon, Dec 29, 2014 at 4:49 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: Hi! It’s really a tradeoff between accurate and fast and your read access patterns; if you need it to be fairly fast, use counters by all means, but accept the fact that they will (especially in older versions of cassandra or adverse network conditions) drift off from the true click count. If you need accurate, use a timeuuid and count the rows (this is fairly safe for replays too). However, if using timeuuids your storage will need lots of space; and your reads will be slow if the click counts are huge (because Cassandra will need to read every item). Using counters makes it easy to just grab a slice of the time series data and shove it to a client for visualization. You could of course do a hybrid system; use timeuuids and then periodically count and add the result to a regular column, and then remove the columns. Note that you might want to optimize this so that you don’t end up with a lot of tombstones, e.g. by bucketing the writes so that you can delete everything with just a single partition delete. At Thinglink some of the more important counters that we use are backed up by the actual data. So for speed purposes we use always counters for reads, but there’s a repair process that fixes the counter value if we suspect it starts drifting off the real data too much. (You might be able to tell that we’ve been using counters for quite some time :-P) /Janne On 29 Dec 2014, at 13:00, Ajay ajay.ga...@gmail.com wrote: Hi, Is it better to use Counter to User click count than maintaining creating new row as user id : timestamp and count it. Basically we want to track the user clicks and use the same for hourly/daily/monthly report. Thanks Ajay
Re: User click count
Thanks for the clarification. In my case, Cassandra is the only storage. If the counters get incorrect, it could't be corrected. For that if we store raw data, we can as well go that approach. But the granularity has to be as seconds level as more than one user can click the same link. So the data will be huge with more writes and more rows to count for reads right? Thanks Ajay On Mon, Dec 29, 2014 at 7:10 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Ajay, Here is a good explanation you might want to read. http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters Though we use counters for 3 years now, we used them from start C* 0.8 and we are happy with them. Limits I can see in both ways are: Counters: - accuracy indeed (Tend to be small in our use case 5% - when the business allow 10%, so fair enough for us) + we recount them through a batch processing tool (spark / hadoop - Kind of lambda architecture). So our real-time stats are inaccurate and after a few minutes or hours we have the real value. - Read-Before-Write model, which is an anti-pattern. Makes you use more machine due to the pressure involved, affordable for us too. Raw data (counted) - Space used (can become quite impressive very fast, depending on your business) ! - Time to answer a request (we expose the data to customer, they don't want to wait 10 sec for Cassandra to read 1 000 000 + columns) - Performances in o(n) (linear) instead of o(1) (constant). Customer won't always understand that for you it is harder to read 1 than 1 000 000, since it should be reading 1 number in both case, and your interface will have very unstable read time. Pick the best solution (or combination) for your use case. Those disadvantages lists are not exhaustive, just things that came to my mind right now. C*heers Alain 2014-12-29 13:33 GMT+01:00 Ajay ajay.ga...@gmail.com: Hi, So you mean to say counters are not accurate? (It is highly likely that multiple parallel threads trying to increment the counter as users click the links). Thanks Ajay On Mon, Dec 29, 2014 at 4:49 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote: Hi! It’s really a tradeoff between accurate and fast and your read access patterns; if you need it to be fairly fast, use counters by all means, but accept the fact that they will (especially in older versions of cassandra or adverse network conditions) drift off from the true click count. If you need accurate, use a timeuuid and count the rows (this is fairly safe for replays too). However, if using timeuuids your storage will need lots of space; and your reads will be slow if the click counts are huge (because Cassandra will need to read every item). Using counters makes it easy to just grab a slice of the time series data and shove it to a client for visualization. You could of course do a hybrid system; use timeuuids and then periodically count and add the result to a regular column, and then remove the columns. Note that you might want to optimize this so that you don’t end up with a lot of tombstones, e.g. by bucketing the writes so that you can delete everything with just a single partition delete. At Thinglink some of the more important counters that we use are backed up by the actual data. So for speed purposes we use always counters for reads, but there’s a repair process that fixes the counter value if we suspect it starts drifting off the real data too much. (You might be able to tell that we’ve been using counters for quite some time :-P) /Janne On 29 Dec 2014, at 13:00, Ajay ajay.ga...@gmail.com wrote: Hi, Is it better to use Counter to User click count than maintaining creating new row as user id : timestamp and count it. Basically we want to track the user clicks and use the same for hourly/daily/monthly report. Thanks Ajay
Re: Counter Column
Thanks. I went through some articles which mentioned that the client to pass the timestamp for insert and update. Is that anyway we can avoid it and Cassandra assume the current time of the server? Thanks Ajay On Dec 26, 2014 10:50 PM, Eric Stevens migh...@gmail.com wrote: Timestamps are timezone independent. This is a property of timestamps, not a property of Cassandra. A given moment is the same timestamp everywhere in the world. To display this in a human readable form, you then need to know what timezone you're attempting to represent the timestamp as, this is the information necessary to convert it to local time. On Fri, Dec 26, 2014 at 2:05 AM, Ajay ajay.ga...@gmail.com wrote: Hi, If the nodes of Cassandra ring are in different timezone, could it affect the counter column as it depends on the timestamp? Thanks Ajay
Counter Column
Hi, If the nodes of Cassandra ring are in different timezone, could it affect the counter column as it depends on the timestamp? Thanks Ajay
Throughput Vs Latency
Hi, I am new to No SQL (and Cassandra). As I am going through few articles on Cassandra, it says Cassandra achieves highest throughput among various No SQL solutions but at the cost of high read and write latency. I have a basic question here - (If my understanding is right) Latency means the time taken to accept input, process and respond back. If Latency is more how come the Throughput is high? Thanks Ajay
Re: Throughput Vs Latency
Thanks Thomas for the clarification. If I use the Consistency level of QUORUM for Read and Write, the Latency would affect the Throughput right? Thanks Ajay On Fri, Dec 26, 2014 at 11:15 AM, Job Thomas j...@suntecgroup.com wrote: Hi, First of all,the write latency of cassandra is not high(Read is high). The high throughput is achieved through distributes read and write. Your doubt ( If Latency is more how come the Throughput is high ) is some what right if you put high consistency to both read and write. You will get distributed abilities since it is not Master/Slave architecture(Like HBase). If your consistency is lesser,then some nodes out of all replica nodes are free and will be used for another read/write . [ Think you are using multithreaded application ] Thanks Regards Job M Thomas Platform Technology Mob : 7560885748 -- *From:* Ajay [mailto:ajay.ga...@gmail.com] *Sent:* Fri 12/26/2014 10:46 AM *To:* user@cassandra.apache.org *Subject:* Throughput Vs Latency Hi, I am new to No SQL (and Cassandra). As I am going through few articles on Cassandra, it says Cassandra achieves highest throughput among various No SQL solutions but at the cost of high read and write latency. I have a basic question here - (If my understanding is right) Latency means the time taken to accept input, process and respond back. If Latency is more how come the Throughput is high? Thanks Ajay
Re: Throughput Vs Latency
Hi Thomas, I am little confused when you say multithreaded client. Actually we don't explicitly invoke read on multiple servers (for replicated data) from the client code. So how does multithreaded client fix this? Thanks Ajay On Fri, Dec 26, 2014 at 12:08 PM, Job Thomas j...@suntecgroup.com wrote: Hi Ajay, My understanding is this,If you have a cluster of 3 nodes with replication factor of 3 , then the latency has more roll in throughput. It the cluster size is 6 with replication factor or 3 and if you are using multithreaded client, then the latency remain same and you will get better throughput.(Not because of 6 node but because of 6 nodes and multiple threads). Thanks Regards Job M Thomas Platform Technology Mob : 7560885748 From: Ajay [mailto:ajay.ga...@gmail.com] Sent: Fri 12/26/2014 11:57 AM To: user@cassandra.apache.org Subject: Re: Throughput Vs Latency Thanks Thomas for the clarification. If I use the Consistency level of QUORUM for Read and Write, the Latency would affect the Throughput right? Thanks Ajay On Fri, Dec 26, 2014 at 11:15 AM, Job Thomas j...@suntecgroup.com wrote: Hi, First of all,the write latency of cassandra is not high(Read is high). The high throughput is achieved through distributes read and write. Your doubt ( If Latency is more how come the Throughput is high ) is some what right if you put high consistency to both read and write. You will get distributed abilities since it is not Master/Slave architecture(Like HBase). If your consistency is lesser,then some nodes out of all replica nodes are free and will be used for another read/write . [ Think you are using multithreaded application ] Thanks Regards Job M Thomas Platform Technology Mob : 7560885748 From: Ajay [mailto:ajay.ga...@gmail.com] Sent: Fri 12/26/2014 10:46 AM To: user@cassandra.apache.org Subject: Throughput Vs Latency Hi, I am new to No SQL (and Cassandra). As I am going through few articles on Cassandra, it says Cassandra achieves highest throughput among various No SQL solutions but at the cost of high read and write latency. I have a basic question here - (If my understanding is right) Latency means the time taken to accept input, process and respond back. If Latency is more how come the Throughput is high? Thanks Ajay
Re: Cassandra for Analytics?
Thanks Ryan and Peter for the suggestions. Our requirement(an ecommerce company) at a higher level is to build a Datawarehouse as a platform or service(for different product teams to consume) as below: Datawarehouse as a platform/service | Spark SQL | Spark in memory computation engine (We were considering Drill/Flink but Spark is better mature and in production) | Cassandra/HBase (Yet to be decided. Aggregated views + data directly written to this. So 40%-50% writes, 50-60% reads) | Streaming processing (Spark Streaming or Storm. Yet to be decided. Spark streaming is relatively new) | My SQL/Mongo/Real Time data Since we are planning to build it as a service, we cannot consider a particular data access pattern. Thanks Ajay On Thu, Dec 18, 2014 at 7:00 PM, Peter Lin wool...@gmail.com wrote: for the record I think spark is good and I'm glad we have options. my point wasn't to bad mouth spark. I'm not comparing spark to storm at all, so I think there's some confusion here. I'm thinking of espers, streambase, and other stream processing products. My point is to think about the problems that needs to be solved before picking a solution. Like everyone else, I've been guilty of this in the past, so it's not propaganda for or against any specific product. I've seen customers user IBM infosphere streams when something like storm or spark would work, but I've also seen cases where open source doesn't provide equivalent functionality. If spark meets the needs, then either hbase or cassandra will probably work fine. The bigger question is what patterns do you use in the architecture? Do you store the data first before doing analysis? Is the data noisy and needs filtering before persistence? What kinds of patterns/queries and operations are needed? having worked on trading systems and other real-time use cases, not all stream processing is the same. On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com wrote: I'll decline to continue the commentary on spark, as again this probably belongs on another list, other than to say, microbatches is an intentional design tradeoff that has notable benefits for the same use cases you're referring too, and that while you may disagree with those tradeoffs, it's a bit harsh to dismiss as basic something that was chosen and provides some improvements over say..the Storm model. On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote: some of the most common types of use cases in stream processing is sliding windows based on time or count. Based on my understanding of spark architecture and spark streaming, it does not provide the same functionality. One can fake it by setting spark streaming to really small micro-batches, but that's not the same. if the use case fits that model, than using spark is fine. For other kinds of use cases, spark may not be a good fit. Some people store all events before analyzing it, which works for some use cases. While other uses cases like trading systems, store before analysis isn't feasible or practical. Other use cases like command control also don't fit store before analysis model. Try to avoid putting the cart infront of the horse. Picking a tool before you have a clear understanding of the problem is a good recipe for disaster On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla rsvi...@datastax.com wrote: Since Ajay is already using spark the Spark Cassandra Connector really gets them where they want to be pretty easily https://github.com/datastax/spark-cassandra-connector (joins, etc). As far as spark streaming having basic support I'd challenge that assertion (namely Storm has a number of problems with delivery guarantees that Spark basically solves), however, this isn't a Spark mailing list, and perhaps this conversation is better had there. If the question Is Cassandra used in real time analytics cases with Spark? the answer is absolutely yes (and Storm for that matter). If the question is Can you do your analytics queries on Cassandra while you have Spark sitting there doing nothing? then of course the answer is no, but that'd be a bizzare question, they already have Spark in use. On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wool...@gmail.com wrote: that depends on what you mean by real-time analytics. For things like continuous data streams, neither are appropriate platforms for doing analytics. They're good for storing the results (aka output) of the streaming analytics. I would suggest before you decide cassandra vs hbase, first figure out exactly what kind of analytics you need to do. Start with prototyping and look at what kind of queries and patterns you need to support. neither hbase or cassandra are good for complex patterns that do joins or cross joins (aka mdx), so using either one you have
Re: Cassandra for Analytics?
Hi Peter, You are right.The idea is to directly query the data from No SQL, in our case via Spark SQL on Spark (as largely Spark support Mongo/Cassandra/HBase/Hadoop). As you said, the business users still need to query using Spark SQL. We are already using No SQL BI tools like Pentaho (which also plans to support Spark SQL soon). The idea is to abstract the business users from the storage solutions (more than one. Cassandra/HBase Mongo). Thanks Ajay On Thu, Dec 18, 2014 at 8:01 PM, Peter Lin wool...@gmail.com wrote: by data warehouse, what kind do you mean? is it the traditional warehouse where people create multi-dimensional cubes? or is it the newer class of UI tools that makes it easier for users to explore data and the warehouse is mostly a denormalized (ie flattened) format of the OLTP? or is it a combination of both? from my experience, the biggest challenge of data warehousing isn't storing the data. It's making it easy to explore for adhoc mdx-like queries. In the old days, the DBA's would define the cubes, write the ETL routines and let the data load for days/weeks. In the new nosql model, you can avoid the cube + ETL phase, but discovering the data and understanding the format still requires a developer. getting the data into an user friendly format like a cube with Spark still requires a developer. I find that business users hate to go to the developer, because we tend to ask what's the functional specs? Most of the time business users don't know, they just want to explore. At that point, the storage engine largely doesn't matter to the end user. It matters to the developers, but business users don't care. based on the description, I would watch out for how many aggregated views the platform creates. search the mailing list to see past discussions on the maximum recommended number of column families. where classic data warehouse caused lots of pain is creating cubes. Any general solution attempting to replace/supplement existing products needs to make it easy and trivial to define adhoc cubes and then query against it. There are existing products that already connect to a few nosql databases for data exploration. hope that helps peter On Thu, Dec 18, 2014 at 9:01 AM, Ajay ajay.ga...@gmail.com wrote: Thanks Ryan and Peter for the suggestions. Our requirement(an ecommerce company) at a higher level is to build a Datawarehouse as a platform or service(for different product teams to consume) as below: Datawarehouse as a platform/service | Spark SQL | Spark in memory computation engine (We were considering Drill/Flink but Spark is better mature and in production) | Cassandra/HBase (Yet to be decided. Aggregated views + data directly written to this. So 40%-50% writes, 50-60% reads) | Streaming processing (Spark Streaming or Storm. Yet to be decided. Spark streaming is relatively new) | My SQL/Mongo/Real Time data Since we are planning to build it as a service, we cannot consider a particular data access pattern. Thanks Ajay On Thu, Dec 18, 2014 at 7:00 PM, Peter Lin wool...@gmail.com wrote: for the record I think spark is good and I'm glad we have options. my point wasn't to bad mouth spark. I'm not comparing spark to storm at all, so I think there's some confusion here. I'm thinking of espers, streambase, and other stream processing products. My point is to think about the problems that needs to be solved before picking a solution. Like everyone else, I've been guilty of this in the past, so it's not propaganda for or against any specific product. I've seen customers user IBM infosphere streams when something like storm or spark would work, but I've also seen cases where open source doesn't provide equivalent functionality. If spark meets the needs, then either hbase or cassandra will probably work fine. The bigger question is what patterns do you use in the architecture? Do you store the data first before doing analysis? Is the data noisy and needs filtering before persistence? What kinds of patterns/queries and operations are needed? having worked on trading systems and other real-time use cases, not all stream processing is the same. On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com wrote: I'll decline to continue the commentary on spark, as again this probably belongs on another list, other than to say, microbatches is an intentional design tradeoff that has notable benefits for the same use cases you're referring too, and that while you may disagree with those tradeoffs, it's a bit harsh to dismiss as basic something that was chosen and provides some improvements over say..the Storm model. On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote: some of the most common types of use cases in stream processing is sliding
Cassandra for Analytics?
Hi, Can Cassandra be used or best fit for Real Time Analytics? I went through couple of benchmark between Cassandra Vs HBase (most of it was done 3 years ago) and it mentioned that Cassandra is designed for intensive writes and Cassandra has higher latency for reads than HBase. In our case, we will have writes and reads (but reads will be more say 40% writes and 60% reads). We are planning to use Spark as the in memory computation engine. Thanks Ajay
Spark SQL Vs CQL performance on Cassandra
Hi, To test Spark SQL Vs CQL performance on Cassandra, I did the following: 1) Cassandra standalone server (1 server in a cluster) 2) Spark Master and 1 Worker Both running in a Thinkpad laptop with 4 cores and 8GB RAM. 3) Written Spark SQL code using Cassandra-Spark Driver from Cassandra (JavaApiDemo.java. Run with spark://127.0.0.1:7077 127.0.0.1) 4) Writen CQL code using Java driver from Cassandra (CassandraJavaApiDemo.java) In both the case, I create 1 millions rows and query for 1 Observation: 1) It takes less than 10 milliseconds using CQL (SELECT * FROM users WHERE name='Anna') 2) It takes around .6 second using Spark (either SELECT * FROM users WHERE name='Anna' or javaFunctions(sc).cassandraTable(test, people, mapRowTo(Person.class)).where(name=?, Anna); Please let me know if I am missing something in Spark configuration or Cassandra-Spark Driver. Thanks Ajay Garga package com.datastax.demo; import java.text.SimpleDateFormat; import java.util.Date; import com.datastax.driver.core.Cluster; import com.datastax.driver.core.ExecutionInfo; import com.datastax.driver.core.QueryTrace; import com.datastax.driver.core.ResultSet; import com.datastax.driver.core.Row; import com.datastax.driver.core.Session; import com.datastax.driver.core.SimpleStatement; import com.datastax.driver.core.Statement; import com.datastax.driver.core.querybuilder.QueryBuilder; public class CassandraJavaApiDemo { private static SimpleDateFormat format = new SimpleDateFormat( HH:mm:ss.SSS); public static void main(String[] args) { Cluster cluster = null; Session session = null; try { cluster = Cluster.builder().addContactPoint(127.0.0.1).build(); session = cluster.connect(); session.execute(DROP KEYSPACE IF EXISTS test2); session.execute(CREATE KEYSPACE test2 WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}); session.execute(CREATE TABLE test2.users (id INT, name TEXT, birth_date TIMESTAMP, PRIMARY KEY (id) )); session.execute(CREATE INDEX people_name_idx2 ON test2.users(name)); session = cluster.connect(test2); Statement insert = null; for (int i = 0; i 100; i++) { insert = QueryBuilder.insertInto(users).value(id, i) .value(name, Anna + i) .value(birth_date, new Date()); session.execute(insert); } long start = System.currentTimeMillis(); Statement scan = new SimpleStatement( SELECT * FROM users WHERE name='Anna0';); scan.enableTracing(); ResultSet results = session.execute(scan); for (Row row : results) { System.out.format(%d %s\n, row.getInt(id), row.getString(name)); } long end = System.currentTimeMillis(); System.out.println( Time Taken + (end - start)); ExecutionInfo executionInfo = results.getExecutionInfo(); QueryTrace queryTrace = executionInfo.getQueryTrace(); System.out.printf(%-38s | %-12s | %-10s | %-12s\n, activity, timestamp, source, source_elapsed); System.out .println(---+--++--); for (QueryTrace.Event event : queryTrace.getEvents()) { System.out.printf(%38s | %12s | %10s | %12s\n, event.getDescription(), millis2Date(event.getTimestamp()), event.getSource(), event.getSourceElapsedMicros()); } } catch (Exception e) { e.printStackTrace(); } finally { if (session != null) { session.close(); } if (cluster != null) { cluster.close(); } } } private static Object millis2Date(long timestamp) { return format.format(timestamp); } } package com.datastax.spark.connector.demo; import com.datastax.driver.core.Session; import com.datastax.spark.connector.cql.CassandraConnector; import com.datastax.spark.connector.japi.CassandraRow; import com.google.common.base.Objects; import org.apache.hadoop.util.StringUtils; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.SchemaRDD; import org.apache.spark.sql.cassandra.CassandraSQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.Date; import java.util.List; import static com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions; import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapRowTo; import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapToRow; /** * This Spark application demonstrates how to use Spark Cassandra Connector with * Java. * p/ * In order to run it, you will need to run Cassandra database, and create the * following keyspace, table and secondary index: * p/ * * pre * CREATE KEYSPACE test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}; * * CREATE TABLE test.people ( * id INT, * nameTEXT, * birth_date TIMESTAMP, * PRIMARY KEY (id