Re: Repair schedules for new clusters

2016-05-17 Thread Ben Slater
We’ve found with incremental repairs that more frequent repairs are
generally better. Our current standard for incremental repairs is once per
day. I imagine that the exact optimum frequency is dependant on the ratio
of reads to write in your cluster.

Turning on incremental repairs from the get-go works OK if your data load
is increment. If you do a big load before your first incremental repair
then it’s not much different to migrating to incremental repairs so worth
following the procedures for migration to avoid a big impact.

Cheers
Ben

On Tue, 17 May 2016 at 16:50 Ashic Mahtab  wrote:

> Hi All,
> My previous cassandra clusters had moderate loads, and I'd simply schedule
> full repairs at different times in the week (but on the same day). That
> seemed to work ok, but was redundant. In my current project, I'm going to
> need to care about repair times a lot more, and was wondering what would be
> the best way to go about it. I have a few questions around this:
>
> * This would be a brand new cluster, and as such, was wondering if I could
> simply turn on incremental repair from the get go.
> * I would then run nodetool repair -pr -par -inc once a week on every node
> at (roughly) the same time once a week. I'd do this with a cron job /
> external scheduler.
> * If I were to replace a node, or one rejoins after being absent for
> longer than the grace period, I'd run a full repair on that node.
>
> Does this sound reasonable? Are there any pitfalls I should be aware of?
>
> Thanks,
> Ashic.
>
-- 

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


Re: Why simple replication strategy for system_auth ?

2016-05-17 Thread Jérôme Mainaud
Thank you for your answer.

What I still don't understand is why auth data is not managed in the same
way as schema metadata.
Both must be accessible to the node to do the job. Both are changed very
rarely.
In a way users are some kind of database objects.

I understand the choice for trace and repair history, not for
authentication.

I note that 3.0 suggest 3 to 5 nodes. It was my choice but some client told
me I was wrong pointing at 2.1 documentation...
And it was difficult to explain to experienced classic DBAs that creating a
user and granting rights were so different from creating a table that
metadata was stored in a different way.



-- 
Jérôme Mainaud
jer...@mainaud.com

2016-05-13 12:13 GMT+02:00 Sam Tunnicliffe :

> LocalStrategy means that data is not replicated in the usual way and
> remains local to each node. Where it is used, replication is either not
> required (for example in the case of secondary indexes and system.local) or
> happens out of band via some other method (as in the case of schema, or
> system.peers which is populated largely from gossip).
>
> There are several components in Cassandra which generate or persist
> "system" data for which a normal distribution makes sense. Auth data is
> one, tracing, repair history and materialized view status are others. The
> keyspaces for this data generally use SimpleStategy by default as it is
> guaranteed to work out of the box, regardless of topology.  The intent of
> the advice to configure system_auth with RF=N was to increase the
> likelihood that any read of auth data would be done locally, avoiding
> remote requests where possible. This is somewhat outdated though and not
> really necessary. In fact, the 3.x docs actually suggest "3 to 5 nodes per
> Data Center"[1]
>
> FTR, you can't specify LocalStrategy in a CREATE or ALTER KEYSPACE, for
> these reasons.
>
> [1]
> http://docs.datastax.com/en/cassandra/3.x/cassandra/configuration/secureConfigNativeAuth.htm
>
>
> On Fri, May 13, 2016 at 10:47 AM, Jérôme Mainaud 
> wrote:
>
>> Hello,
>>
>> Is there any good reason why system_auth strategy is SimpleStrategy by
>> default instead of LocalStrategy like system and system_schema ?
>>
>> Especially when documentation advice to set the replication factor to the
>> number of nodes in the cluster, which is both weird and inconvenient to
>> follow.
>>
>> Do you think that changing the strategy to LocalStrategy would work or
>> have undesirable side effects ?
>>
>> Thank you.
>>
>> --
>> Jérôme Mainaud
>> jer...@mainaud.com
>>
>
>


Restoring Incremental Backups without using sstableloader

2016-05-17 Thread Ravi Teja A V
Hi everyone

I am currently working with Cassandra 3.5. I would like to know if it is
possible to restore backups without using sstableloader. I have been
referring to the following pages in the datastax documentation:
https://docs.datastax.com/en/cassandra/3.x/cassandra/operations/opsBackupSnapshotRestore.html
Thank you.

Yours sincerely
RAVI TEJA A V


Re: SS Table File Names not containing GUIDs

2016-05-17 Thread Alain RODRIGUEZ
Hi,

I am wondering if there is any reason as to why the SS Table format doesn’t
> have a GUID


I don't know for sure, but what I can say is that GUID is often used to
solve the incremental issue on distributed system. SSTables are store on
one node, so increment works. So I would say this worked and was straight
forward. This is probably the reason. Plus sstables name / path are long
enough, I prefer to see '241' in there than
'c0629566-4a15-4db2-bb97-ee6e083de32b'.

Specifically, this causes some inconvenience when restoring snapshots.


This is true. Excepted in 5 years using Cassandra I restored snapshot maybe
twice. To feed staging (empty, so no issue) and to test recovery. So it is
not that often.

The problem is it is possible to overwrite new data with old files if the
> file names match. I can’t change the file names of snapshot-ed file to a
> huge number, because as soon as that file is copied over, C* will use that
> number in its get-next-number-gen logic potentially causing the same
> problem for the next snapshot-ed file.


What about using a lower value? Also if your value is really greater than
the current one, the risk is low, tables are being compacted often enough.
There are many relatively easy and working workaround here I believe. I
don't remember how I solved this though.

I would say I do not agree that we need to use GUID, but it is just my
opinion, if you fill this could be an improvement, search for a ticket
about that or fill up a new one.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-02 18:55 GMT+02:00 Anubhav Kale :

> Hello,
>
>
>
> I am wondering if there is any reason as to why the SS Table format
> doesn’t have a GUID. As far as I can tell, the incrementing number isn’t
> really used for any special purpose in code, and having a unique name for
> the file seems to be a better thing, in general.
>
>
>
> Specifically, this causes some inconvenience when restoring snapshots.
> Ideally, I would like to restore just the system* keyspaces and boot the
> node. Then, once the node is taking live traffic copy the SS Tables over
> and do a DSE restart at the end to load old data.
>
>
>
> The problem is it is possible to overwrite new data with old files if the
> file names match. I can’t change the file names of snapshot-ed file to a
> huge number, because as soon as that file is copied over, C* will use that
> number in its get-next-number-gen logic potentially causing the same
> problem for the next snapshot-ed file.
>
>
>
> How do people usually tackle this ? Is there some easy solution that I am
> not seeing ?
>
>
>
> Thanks !
>


Re: Cassandra Debian repos (Apache vs DataStax)

2016-05-17 Thread Eric Evans
On Mon, May 16, 2016 at 5:19 PM, Drew Kutcharian  wrote:
>
> What’s the difference between the two “Community” repositories Apache 
> (http://www.apache.org/dist/cassandra/debian) and DataStax 
> (http://debian.datastax.com/community/)?

Good question.  All I can tell you is that the Apache repository is
the official one (the only official one).

> If they are just mirrors, then it seems like the DataStax one is a bit behind 
> (version 3.0.6 is available on Apache but not on DataStax).
>
> I’ve been using the DataStax community repo and wanted to see if I still 
> should continue using it or switch to the Apache repo.

If it is your intention to run Apache Cassandra, from the Apache
Cassandra project, then you should be using the Apache repo.

-- 
Eric Evans
eev...@apache.org


Re: Bloom filter memory usage disparity

2016-05-17 Thread Alain RODRIGUEZ
Hi, we would need more information here (if you did not solve it yet).

What is your Cassandra version?
Does this 3 node cluster use a Replication Factor of 3?
Did you change the bloom_filter_fp_chance recently?

That table has about 16M keys and 140GB of data.
>

Is that the total value or per node? In any case, we need the data size for
the 3 nodes to understand.

It might have been a temporary situation, but in this case you would know
by now.

C*heers,


2016-05-03 18:47 GMT+02:00 Kai Wang :

> Hi,
>
> I have a table on 3-node cluster. I notice bloom filter memory usage are
> very different on one of the node. For a given table, I checked
> CassandraMetricsRegistry$JmxGauge.[table]_BloomFilterOffHeapMemoryUsed.Value.
> 2 of 3 nodes show 1.5GB while the other shows 2.5 GB.
>
> What could be the reason?
>
> That table is using LCS.
> bloom_filter_fp_chance=0.1
> That table has about 16M keys and 140GB of data.
>
> Thanks.
>


Re: MigrationManager.java:164 - Migration task failed to complete

2016-05-17 Thread Alain RODRIGUEZ
There is not that much context here, so I will do a standard answer too.

If you have a doubt regarding the data owned by a node, running repair
takes some resources but should never break anything. I mean it is an
operation you can be running as much as you want. So I would use it, just
in case.

If the repair finishes successfully, your data is now consistent.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-03 20:09 GMT+02:00 Zhang, Charles :

> I have seen a bunch of them in the log files of some newly joined nodes. I
> did a search in google and it seems increasing the countdown latch timeout
> can solve this problem. But I assume it only resolves it for future nodes
> when joining happens? For the existing nodes, anything needs to be done?
>


Re: [C*3.0.3]lucene indexes not deleted and nodetool repair makes DC unavailable

2016-05-17 Thread Siddharth Verma
Hi Eduardo,
Thanks for your reply. If it is fixed in 3.0.5.1, we will shift to it.

One more question,
If instead of truncating table, we remove some rows, then
are the lucene documents and indexes for those rows deleted?


Nodetool clearsnapshot doesn't support Column Families

2016-05-17 Thread Anubhav Kale
Hello,

I noticed that clearsnapshot doesn't support removing snapshots on a per CF, 
like how snapshots lets you take it per CF.

http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsClearSnapShot.html

I couldn't find a JIRA to address this. Is this intentional ? If so, I am 
curious to understand the rationale.

In absence of this, I would just rm -rf the folder to suit my requirements. Are 
there any bad-effects of doing so ?

Thanks !


Re: [C*3.0.3]lucene indexes not deleted and nodetool repair makes DC unavailable

2016-05-17 Thread Andres de la Peña
Hi Siddarth,

Lucene doesn't immediately remove deleted documents from disk. Instead, it
just marks them as deleted, and they are effectively removed during
segments merge. This is quite similar to how C* manages deletions with
tombstones and compactions.

Regards,

2016-05-17 17:30 GMT+01:00 Siddharth Verma :

> Hi Eduardo,
> Thanks for your reply. If it is fixed in 3.0.5.1, we will shift to it.
>
> One more question,
> If instead of truncating table, we remove some rows, then
> are the lucene documents and indexes for those rows deleted?
>



-- 
Andrés de la Peña

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
*


Re: Cassandra Debian repos (Apache vs DataStax)

2016-05-17 Thread Drew Kutcharian
Thanks Eric.


> On May 17, 2016, at 7:50 AM, Eric Evans  wrote:
> 
> On Mon, May 16, 2016 at 5:19 PM, Drew Kutcharian  wrote:
>> 
>> What’s the difference between the two “Community” repositories Apache 
>> (http://www.apache.org/dist/cassandra/debian) and DataStax 
>> (http://debian.datastax.com/community/)?
> 
> Good question.  All I can tell you is that the Apache repository is
> the official one (the only official one).
> 
>> If they are just mirrors, then it seems like the DataStax one is a bit 
>> behind (version 3.0.6 is available on Apache but not on DataStax).
>> 
>> I’ve been using the DataStax community repo and wanted to see if I still 
>> should continue using it or switch to the Apache repo.
> 
> If it is your intention to run Apache Cassandra, from the Apache
> Cassandra project, then you should be using the Apache repo.
> 
> -- 
> Eric Evans
> eev...@apache.org



Re: Cassandra Debian repos (Apache vs DataStax)

2016-05-17 Thread Drew Kutcharian
BTW, the language on this page should probably change since it currently sounds 
like the official repo is the DataStax one and Apache is only an “alternative"

http://wiki.apache.org/cassandra/DebianPackaging

- Drew

> On May 17, 2016, at 11:35 AM, Drew Kutcharian  wrote:
> 
> Thanks Eric.
> 
> 
>> On May 17, 2016, at 7:50 AM, Eric Evans  wrote:
>> 
>> On Mon, May 16, 2016 at 5:19 PM, Drew Kutcharian  wrote:
>>> 
>>> What’s the difference between the two “Community” repositories Apache 
>>> (http://www.apache.org/dist/cassandra/debian) and DataStax 
>>> (http://debian.datastax.com/community/)?
>> 
>> Good question.  All I can tell you is that the Apache repository is
>> the official one (the only official one).
>> 
>>> If they are just mirrors, then it seems like the DataStax one is a bit 
>>> behind (version 3.0.6 is available on Apache but not on DataStax).
>>> 
>>> I’ve been using the DataStax community repo and wanted to see if I still 
>>> should continue using it or switch to the Apache repo.
>> 
>> If it is your intention to run Apache Cassandra, from the Apache
>> Cassandra project, then you should be using the Apache repo.
>> 
>> -- 
>> Eric Evans
>> eev...@apache.org
> 



Re: Cassandra Debian repos (Apache vs DataStax)

2016-05-17 Thread Drew Kutcharian
OK to make things even more confusing, the “Release” files in the Apache Repo 
say "Origin: Unofficial Cassandra Packages”!!

i.e. http://dl.bintray.com/apache/cassandra/dists/35x/:Release


> On May 17, 2016, at 12:11 PM, Drew Kutcharian  wrote:
> 
> BTW, the language on this page should probably change since it currently 
> sounds like the official repo is the DataStax one and Apache is only an 
> “alternative"
> 
> http://wiki.apache.org/cassandra/DebianPackaging
> 
> - Drew
> 
>> On May 17, 2016, at 11:35 AM, Drew Kutcharian  wrote:
>> 
>> Thanks Eric.
>> 
>> 
>>> On May 17, 2016, at 7:50 AM, Eric Evans  wrote:
>>> 
>>> On Mon, May 16, 2016 at 5:19 PM, Drew Kutcharian  wrote:
 
 What’s the difference between the two “Community” repositories Apache 
 (http://www.apache.org/dist/cassandra/debian) and DataStax 
 (http://debian.datastax.com/community/)?
>>> 
>>> Good question.  All I can tell you is that the Apache repository is
>>> the official one (the only official one).
>>> 
 If they are just mirrors, then it seems like the DataStax one is a bit 
 behind (version 3.0.6 is available on Apache but not on DataStax).
 
 I’ve been using the DataStax community repo and wanted to see if I still 
 should continue using it or switch to the Apache repo.
>>> 
>>> If it is your intention to run Apache Cassandra, from the Apache
>>> Cassandra project, then you should be using the Apache repo.
>>> 
>>> -- 
>>> Eric Evans
>>> eev...@apache.org
>> 
> 



Applying TTL Change quickly

2016-05-17 Thread Anubhav Kale
Hello,

We use STCS and DTCS on our tables and recently made a TTL change (reduced from 
8 days to 2) on a table with large amounts of data. What is the best way to 
quickly purge old data ? I am playing with tombstone_compaction_interval at the 
moment, but would like some suggestions on what else can be done to reclaim the 
space as quick as possible.

Thanks !


Re: Applying TTL Change quickly

2016-05-17 Thread Jeff Jirsa
Fastest way? Stop cassandra, use sstablemetadata to remove any files with 
maxTimestamp > 2 days. Start cassandra. Works better with some compaction 
strategies than others (probably find a few droppable sstables with either DTCS 
/ STCS, but not perfect). 

Cleanest way? One by one (starting with oldest sstables first), use 
forceUserDefinedCompaction on each sstable and let it purge out the droppable 
garbage. This is what the tombstone sub properties would do.




From:  Anubhav Kale
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, May 17, 2016 at 4:17 PM
To:  "user@cassandra.apache.org"
Subject:  Applying TTL Change quickly

Hello,

 

We use STCS and DTCS on our tables and recently made a TTL change (reduced from 
8 days to 2) on a table with large amounts of data. What is the best way to 
quickly purge old data ? I am playing with tombstone_compaction_interval at the 
moment, but would like some suggestions on what else can be done to reclaim the 
space as quick as possible.

 

Thanks !



smime.p7s
Description: S/MIME cryptographic signature


restore cassandra snapshots on a smaller cluster

2016-05-17 Thread Luigi Tagliamonte
Hi everyone,
i'm wondering if it is possible to restore all the snapshots of a cluster
(10 nodes) in a smaller cluster (3 nodes)? If yes how to do it?

-- 
Luigi
---
“The only way to get smarter is by playing a smarter opponent.”


Re: restore cassandra snapshots on a smaller cluster

2016-05-17 Thread Jeff Jirsa
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated



From:  Luigi Tagliamonte
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, May 17, 2016 at 5:35 PM
To:  "user@cassandra.apache.org"
Subject:  restore cassandra snapshots on a smaller cluster

Hi everyone,
i'm wondering if it is possible to restore all the snapshots of a cluster (10 
nodes) in a smaller cluster (3 nodes)? If yes how to do it?

-- 
Luigi
---
“The only way to get smarter is by playing a smarter opponent.”



smime.p7s
Description: S/MIME cryptographic signature


Re: restore cassandra snapshots on a smaller cluster

2016-05-17 Thread Ben Slater
It should definitely work if you use sstableloader to load all the files. I
imagine it is possible doing a straight restore (copy sstables) if you
assign the tokens from multiple source nodes to one target node  using the
initial_token parameter in cassandra.yaml.

Cheers
Ben

On Wed, 18 May 2016 at 10:35 Luigi Tagliamonte  wrote:

> Hi everyone,
> i'm wondering if it is possible to restore all the snapshots of a cluster
> (10 nodes) in a smaller cluster (3 nodes)? If yes how to do it?
>
> --
> Luigi
> ---
> “The only way to get smarter is by playing a smarter opponent.”
>
-- 

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


Re: Bloom filter memory usage disparity

2016-05-17 Thread Kai Wang
Alain,

Thanks for replying.

I am using C* 2.2.4.
Yes the table is RF=3.
I changed bloom_filter_fp_chance from 0.01 to 0.1 a couple of months ago.


On Tue, May 17, 2016 at 11:05 AM, Alain RODRIGUEZ 
wrote:

> Hi, we would need more information here (if you did not solve it yet).
>
> What is your Cassandra version?
> Does this 3 node cluster use a Replication Factor of 3?
> Did you change the bloom_filter_fp_chance recently?
>
> That table has about 16M keys and 140GB of data.
>>
>
> Is that the total value or per node? In any case, we need the data size
> for the 3 nodes to understand.
>
> It might have been a temporary situation, but in this case you would know
> by now.
>
> C*heers,
>
>
> 2016-05-03 18:47 GMT+02:00 Kai Wang :
>
>> Hi,
>>
>> I have a table on 3-node cluster. I notice bloom filter memory usage are
>> very different on one of the node. For a given table, I checked
>> CassandraMetricsRegistry$JmxGauge.[table]_BloomFilterOffHeapMemoryUsed.Value.
>> 2 of 3 nodes show 1.5GB while the other shows 2.5 GB.
>>
>> What could be the reason?
>>
>> That table is using LCS.
>> bloom_filter_fp_chance=0.1
>> That table has about 16M keys and 140GB of data.
>>
>> Thanks.
>>
>
>


Re: Bloom filter memory usage disparity

2016-05-17 Thread Jeff Jirsa
Even with the same data, bloom filter is based on sstables. If your compaction 
behaves differently on 2 nodes than the third, your bloom filter RAM usage may 
be different.


From:  Kai Wang
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, May 17, 2016 at 8:02 PM
To:  "user@cassandra.apache.org"
Subject:  Re: Bloom filter memory usage disparity

Alain,

Thanks for replying.

I am using C* 2.2.4. 
Yes the table is RF=3. 
I changed bloom_filter_fp_chance from 0.01 to 0.1 a couple of months ago.


On Tue, May 17, 2016 at 11:05 AM, Alain RODRIGUEZ  wrote:
Hi, we would need more information here (if you did not solve it yet). 

What is your Cassandra version?
Does this 3 node cluster use a Replication Factor of 3?
Did you change the bloom_filter_fp_chance recently?

That table has about 16M keys and 140GB of data.

Is that the total value or per node? In any case, we need the data size for the 
3 nodes to understand.

It might have been a temporary situation, but in this case you would know by 
now.

C*heers,


2016-05-03 18:47 GMT+02:00 Kai Wang :
Hi,

I have a table on 3-node cluster. I notice bloom filter memory usage are very 
different on one of the node. For a given table, I checked 
CassandraMetricsRegistry$JmxGauge.[table]_BloomFilterOffHeapMemoryUsed.Value. 2 
of 3 nodes show 1.5GB while the other shows 2.5 GB.

What could be the reason?

That table is using LCS. 
bloom_filter_fp_chance=0.1
That table has about 16M keys and 140GB of data.

Thanks.





smime.p7s
Description: S/MIME cryptographic signature


Re: Accessing Cassandra data from Spark Shell

2016-05-17 Thread Cassa L
Hi,
I followed instructions to run SparkShell with Spark-1.6. It works fine.
However, I need to use spark-1.5.2 version. With it, it does not work. I
keep getting NoSuchMethod Errors. Is there any issue running Spark Shell
for Cassandra using older version of Spark?


Regards,
LCassa

On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
wrote:

> Yes, it is very simple to access Cassandra data using Spark shell.
>
>
>
> Step 1: Launch the spark-shell with the spark-cassandra-connector package
>
> $SPARK_HOME/bin/spark-shell --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0
>
>
>
> Step 2: Create a DataFrame pointing to your Cassandra table
>
> val dfCassTable = sqlContext.read
>
>
> .format("org.apache.spark.sql.cassandra")
>
>  .options(Map(
> "table" -> "your_column_family", "keyspace" -> "your_keyspace"))
>
>  .load()
>
>
>
> From this point onward, you have complete access to the DataFrame API. You
> can even register it as a temporary table, if you would prefer to use
> SQL/HiveQL.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
> *Sent:* Monday, May 9, 2016 9:28 PM
> *To:* user@cassandra.apache.org; user
> *Subject:* Re: Accessing Cassandra data from Spark Shell
>
>
>
> You can use SparkShell to access Cassandra via the Spark Cassandra
> connector. The getting started article on our support page will probably
> give you a good steer to get started even if you’re not using Instaclustr:
> https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-
>
>
>
> Cheers
>
> Ben
>
>
>
> On Tue, 10 May 2016 at 14:08 Cassa L  wrote:
>
> Hi,
>
> Has anyone tried accessing Cassandra data using SparkShell? How do you do
> it? Can you use HiveContext for Cassandra data? I'm using community version
> of Cassandra-3.0
>
>
>
> Thanks,
>
> LCassa
>
> --
>
> 
>
> Ben Slater
>
> Chief Product Officer, Instaclustr
>
> +61 437 929 798
>


About the data structure of partition index

2016-05-17 Thread Hiroyuki Yamada
Hi,

I am wondering how many primary keys are stored in one partition index.

As the following documents say,




I understand that each partition index has a list of primary keys and
the start position of compression offset map,
So, I assume the logical data structure of a partition index would be
like the following:

| [pkey1-pkeyN] | offset-to compression offset map |
(indexed by the first column to retrieve by a partition key)

I am wondering if it is a correct understanding and
how many primary keys are stored in the first column.

If it is not correct, would anyone give me the correct logical data structure ?

Thanks,
Hiro


Re: Accessing Cassandra data from Spark Shell

2016-05-17 Thread Ben Slater
It definitely should be possible for 1.5.2 (I have used it with spark-shell
and cassandra connector with 1.4.x). The main trick is in lining up all the
versions and building an appropriate connector jar.

Cheers
Ben

On Wed, 18 May 2016 at 15:40 Cassa L  wrote:

> Hi,
> I followed instructions to run SparkShell with Spark-1.6. It works fine.
> However, I need to use spark-1.5.2 version. With it, it does not work. I
> keep getting NoSuchMethod Errors. Is there any issue running Spark Shell
> for Cassandra using older version of Spark?
>
>
> Regards,
> LCassa
>
> On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
> wrote:
>
>> Yes, it is very simple to access Cassandra data using Spark shell.
>>
>>
>>
>> Step 1: Launch the spark-shell with the spark-cassandra-connector package
>>
>> $SPARK_HOME/bin/spark-shell --packages
>> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0
>>
>>
>>
>> Step 2: Create a DataFrame pointing to your Cassandra table
>>
>> val dfCassTable = sqlContext.read
>>
>>
>> .format("org.apache.spark.sql.cassandra")
>>
>>  .options(Map(
>> "table" -> "your_column_family", "keyspace" -> "your_keyspace"))
>>
>>  .load()
>>
>>
>>
>> From this point onward, you have complete access to the DataFrame API.
>> You can even register it as a temporary table, if you would prefer to use
>> SQL/HiveQL.
>>
>>
>>
>> Mohammed
>>
>> Author: Big Data Analytics with Spark
>> 
>>
>>
>>
>> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
>> *Sent:* Monday, May 9, 2016 9:28 PM
>> *To:* user@cassandra.apache.org; user
>> *Subject:* Re: Accessing Cassandra data from Spark Shell
>>
>>
>>
>> You can use SparkShell to access Cassandra via the Spark Cassandra
>> connector. The getting started article on our support page will probably
>> give you a good steer to get started even if you’re not using Instaclustr:
>> https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-
>>
>>
>>
>> Cheers
>>
>> Ben
>>
>>
>>
>> On Tue, 10 May 2016 at 14:08 Cassa L  wrote:
>>
>> Hi,
>>
>> Has anyone tried accessing Cassandra data using SparkShell? How do you do
>> it? Can you use HiveContext for Cassandra data? I'm using community version
>> of Cassandra-3.0
>>
>>
>>
>> Thanks,
>>
>> LCassa
>>
>> --
>>
>> 
>>
>> Ben Slater
>>
>> Chief Product Officer, Instaclustr
>>
>> +61 437 929 798
>>
>
> --

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


Setting bloom_filter_fp_chance < 0.01

2016-05-17 Thread Adarsh Kumar
Hi,

What is the impact of setting bloom_filter_fp_chance < 0.01.

During performance tuning I was trying to tune bloom_filter_fp_chance and
have following questions:

1). Why bloom_filter_fp_chance = 0 is not allowed. (
https://issues.apache.org/jira/browse/CASSANDRA-5013)
2). What is the maximum/recommended value of bloom_filter_fp_chance (if we
do not have any limitation for bloom filter size).

NOTE: We are using default SizeTieredCompactionStrategy on
cassandra  2.1.8.621

Thanks in advance..:)

Adarsh Kumar