Using K8s to Manage Cassandra in Production

2018-05-17 Thread Hassaan Pasha
Hi,

I am trying to craft a deployment strategy for deploying and maintaining a
C* cluster. I was wondering if there are actual production deployments of
C* using K8s as the orchestration layer.

I have been given the impression that K8s managing a C* cluster can be a
recipe for disaster, especially if you aren't well versed with the
intricacies of a scale-up/down event. I know use cases where people are
using Mesos or a custom tool built with terraform/chef etc to run their
production clusters but have yet to find a real K8s use case.

*Questions?*
Is K8s a reasonable choice for managing a production C* cluster?
Are there documented use cases for this?

Any help would be greatly appreciated.

-- 
Regards,


*Hassaan Pasha*


Re: Invalid metadata has been detected for role

2018-05-17 Thread kurt greaves
Can you post the stack trace and you're version of Cassandra?

On Fri., 18 May 2018, 09:48 Abdul Patel,  wrote:

> Hi
>
> I had to decommission one dc , now while adding bacl the same nodes ( i
> used nodetool decommission) they both get added fine and i also see them im
> nodetool status but i am unable to login in them .gives invalid mwtadat
> error, i ran repair and later cleanup as well.
>
> Any ideas?
>
>


Invalid metadata has been detected for role

2018-05-17 Thread Abdul Patel
Hi

I had to decommission one dc , now while adding bacl the same nodes ( i
used nodetool decommission) they both get added fine and i also see them im
nodetool status but i am unable to login in them .gives invalid mwtadat
error, i ran repair and later cleanup as well.

Any ideas?


Re: Cassandra java driver linux encoding issue

2018-05-17 Thread Eric Stevens
What is the value returned from Charset.defaultCharset() on both systems?

On Wed, May 16, 2018 at 5:00 AM rami dabbah  wrote:

> Hi,
>
> I am trying to query text filed from Cassandra using java driver see code
> below. In windows it is working fine but in linux i am getting ??
> instead of Chines characteres
>
>
> Code:
>
> ResultSet shopsRS =
> this.cassandraDAO.getshopsFromScanRawByScanId(cassandraSession,"scan_raw",scanid);
> String record = null;
> for (Row row : shopsRS){
> try {
> pProtocol.addEvent(new
> BaseEvent(BaseEvent.LEVEL_ERROR,"Charset.defaultCharset():"+Charset.defaultCharset()));
> record =row.getString("raw_data");
> Helper.verifyEncoding(record);
> String updated_record =
> Helper.addAttributeToJsonString(pProtocol,row.getString("raw_data"),CommonVars.AUX_DATA,CommonVars.AUX_DATA_BATCH_ID,batchId);
> Helper.verifyEncoding(updated_record);
> producer.sendMessage( updated_record);
> counter++;
> } catch (IOException e) {
> pProtocol.addEvent(new BaseEvent(BaseEvent.LEVEL_ERROR,"Could not send
> Message: "));
> e.printStackTrace();
> }
>
>
>
> example text:
>
> "details_product_name":"佛罗伦萨万豪AC酒店(AC Hotel Firenze)|"
>
>
> --
> Rami Dabbah
> 
> Java Professional.
>


Re: Interesting Results - Cassandra Benchmarks over Time Series Data for IoT Use Case I

2018-05-17 Thread Ben Slater
Approach (4) or (5) are what I would go for - they are (as your results
show) basically identical as the composite partition key gets converted
into a single hash.

Looking at your doc, I think the issue is you are using < operators on the
day field. As Cassandra doesn’t natively do range queries on a hash
SparkSQL is being clever and iterating through all the data to find the
partitions that match your selection criteria.

The approach that we have found is necessary to get good performance is to
provide the actual list of days you are interested in which should allow
the conditions to be fully pushed down from Spark to Cassandra (although
this can be a little hard to control with SparkSQL). I gave a talk at
Cassandra summit a couple of years ago on our approach to a very similar
problem. You can find the slides, including some code snippets, here:
https://www.slideshare.net/Instaclustr/instaclustr-webinar-5-transactions-per-second-with-apache-spark-on-apache-cassandra
and
I think the video is still on Youtube. There is also some update
description and code in this blog post:
https://www.instaclustr.com/upgrading-instametrics-to-cassandra-3/. This
one is a bit high level but you might also find relevant:
https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/

Cheers
Ben

On Thu, 17 May 2018 at 18:06 Arbab Khalil  wrote:

> We have been exploring IoT specific C* schema design over the past few
> months. We wanted to share the benchmarking results with the wider
> community for a) bringing rigor to the discussion, and b) starting a
> discussion for better design.
>
> First the use-case: We have time-series of data from devices on several
> sites, where each device (with a unique dev_id) can have several sensors
> attached to it. Most queries however are both time limited as well as over
> a range of dev_ids, even for a single sensor (Multi-sensor joins are a
> whole different beast for another day!). We want to have a schema where the
> query can complete in time linear to the query ranges for both devices and
> time range, immaterial (largely) to the total data size.
>
>
> So we explored several different primary key definitions, learning from
> the best-practices communicated on this mailing list and over the
> interwebs. While details about the setup (Spark over C*) and schema are in
> a companion blog/site here [1], we just mention the primary keys and the
> key points here.
>
>
>1.
>
>PRIMARY KEY (dev_id, day, rec_time)
>2.
>
>PRIMARY KEY ((dev_id, rec_time)
>3.
>
>PRIMARY KEY (day, dev_id, rec_time)
>4.
>
>PRIMARY KEY ((day, dev_id), rec_time)
>5.
>
>PRIMARY KEY ((dev_id, day), rec_time)
>6.
>
>Combination of above by adding a year field in the schema.
>
>
> The main takeaway (again, please read through the details at [1]) is that
> we really don't have a single schema to answer the use case above without
> some drawback. Thus while the ((day, dev_id), rec_time) gives a constant
> response, it is dependent entirely on the total data size (full scan). On
> the other hand, (dev_id, day, rec_time) and its counterpart (day, dev_id, 
> rec_time)
> provide acceptable results, we have the issue of very large partition space
> in the first, and hotspot while writing for the latter case.
>
> We also observed that having a multi-field partition key allows for fast
> querying only if the "=" is used going left to right. If an IN() (for
> specifying eg. range of time or list of devices) is used once that order,
> than any further usage of IN() removes any benefit (i.e. a near full table
> scan).
> Another useful learning was that using the IN() to query for days is less
> useful than putting in a range query.
>
> Currently, it seems we are in a bind --- should we use a different data
> store for our usecase (which seems quite typical for IoT)? Something like
> HDFS or Parquet? We would love to get feedback on the benchmarking results
> and how we can possibly improve this and share widely.
> [1] Cassandra Benchmarks over Time Series Data for IoT Use Case
> 
>https://sites.google.com/an10.io/timeseries-results
>
>
> --
> Regards,
> Arbab Khalil
> Software Design Engineer
>
-- 


*Ben Slater*

*Chief Product Officer *

   


Read our latest technical blog posts here
.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


Interesting Results - Cassandra Benchmarks over Time Series Data for IoT Use Case I

2018-05-17 Thread Arbab Khalil
 We have been exploring IoT specific C* schema design over the past few
months. We wanted to share the benchmarking results with the wider
community for a) bringing rigor to the discussion, and b) starting a
discussion for better design.

First the use-case: We have time-series of data from devices on several
sites, where each device (with a unique dev_id) can have several sensors
attached to it. Most queries however are both time limited as well as over
a range of dev_ids, even for a single sensor (Multi-sensor joins are a
whole different beast for another day!). We want to have a schema where the
query can complete in time linear to the query ranges for both devices and
time range, immaterial (largely) to the total data size.


So we explored several different primary key definitions, learning from the
best-practices communicated on this mailing list and over the interwebs.
While details about the setup (Spark over C*) and schema are in a companion
blog/site here [1], we just mention the primary keys and the key points
here.


   1.

   PRIMARY KEY (dev_id, day, rec_time)
   2.

   PRIMARY KEY ((dev_id, rec_time)
   3.

   PRIMARY KEY (day, dev_id, rec_time)
   4.

   PRIMARY KEY ((day, dev_id), rec_time)
   5.

   PRIMARY KEY ((dev_id, day), rec_time)
   6.

   Combination of above by adding a year field in the schema.


The main takeaway (again, please read through the details at [1]) is that
we really don't have a single schema to answer the use case above without
some drawback. Thus while the ((day, dev_id), rec_time) gives a constant
response, it is dependent entirely on the total data size (full scan). On
the other hand, (dev_id, day, rec_time) and its counterpart (day,
dev_id, rec_time)
provide acceptable results, we have the issue of very large partition space
in the first, and hotspot while writing for the latter case.

We also observed that having a multi-field partition key allows for fast
querying only if the "=" is used going left to right. If an IN() (for
specifying eg. range of time or list of devices) is used once that order,
than any further usage of IN() removes any benefit (i.e. a near full table
scan).
Another useful learning was that using the IN() to query for days is less
useful than putting in a range query.

Currently, it seems we are in a bind --- should we use a different data
store for our usecase (which seems quite typical for IoT)? Something like
HDFS or Parquet? We would love to get feedback on the benchmarking results
and how we can possibly improve this and share widely.
[1] Cassandra Benchmarks over Time Series Data for IoT Use Case

   https://sites.google.com/an10.io/timeseries-results


-- 
Regards,
Arbab Khalil
Software Design Engineer