Cassandra terminates with OutOfMemory (OOM) error

2013-06-21 Thread Mohammed Guller
We have a 3-node cassandra cluster on AWS. These nodes are running cassandra 
1.2.2 and have 8GB memory. We didn't change any of the default heap or GC 
settings. So each node is allocating 1.8GB of heap space. The rows are wide; 
each row stores around 260,000 columns. We are reading the data using Astyanax. 
If our application tries to read 80,000 columns each from 10 or more rows at 
the same time, some of the nodes run out of heap space and terminate with OOM 
error. Here is the error message:

java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:50)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:126)
at 
org.apache.cassandra.db.filter.ColumnCounter$GroupByPrefix.count(ColumnCounter.java:96)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:164)
at 
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at 
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
at org.apache.cassandra.db.Table.getRow(Table.java:355)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
   at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

ERROR 02:14:05,351 Exception in thread Thread[Thrift:6,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.lang.Long.toString(Long.java:269)
at java.lang.Long.toString(Long.java:764)
at 
org.apache.cassandra.dht.Murmur3Partitioner$1.toString(Murmur3Partitioner.java:171)
at 
org.apache.cassandra.service.StorageService.describeRing(StorageService.java:1068)
at 
org.apache.cassandra.thrift.CassandraServer.describe_ring(CassandraServer.java:1192)
at 
org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3766)
at 
org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3754)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

The data in each column is less than 50 bytes. After adding all the column 
overheads (column name + metadata), it should not be more than 100 bytes. So 
reading 80,000 columns from 10 rows each means that we are reading 80,000 * 10 
* 100 = 80 MB of data. It is large, but not large enough to fill up the 1.8 GB 
heap. So I wonder why the heap is getting full. If the data request is too big 
to fill in a reasonable amount of time, I would expect Cassandra to return a 
TimeOutException instead of terminating.

One easy solution is to increase the heapsize. However that means Cassandra can 
still crash if someone reads 100 rows.  I wonder if there some other Cassandra 
setting that I can tweak to prevent the OOM exception?

Thanks,
Mohammed


Re: Cassandra terminates with OutOfMemory (OOM) error

2013-06-24 Thread Mohammed Guller
No deletes. In my test, I am just writing and reading data.

There is a lot of GC, but only on the younger generation. Cassandra terminates 
before the GC for old generation kicks in.

I know that our queries are reading an unusual amount of data. However, I 
expected it to throw a timeout exception instead of crashing. Also, don't 
understand why 1.8 Gb heap is getting full when the total data stored in the 
entire Cassandra cluster is less than 55 MB.

Mohammed

On Jun 21, 2013, at 7:30 PM, "sankalp kohli" 
mailto:kohlisank...@gmail.com>> wrote:

Looks like you are putting lot of pressure on the heap by doing a slice query 
on a large row.
Do you have lot of deletes/tombstone on the rows? That might be causing a 
problem.
Also why are you returning so many columns as once, you can use auto paginate 
feature in Astyanax.

Also do you see lot of GC happening?


On Fri, Jun 21, 2013 at 1:13 PM, Jabbar Azam 
mailto:aja...@gmail.com>> wrote:
Hello Mohammed,

You should increase the heap space. You should also tune the garbage collection 
so young generation objects are collected faster, relieving pressure on heap We 
have been using jdk 7 and it uses G1 as the default collector. It does a better 
job than me trying to optimise the JDK 6 GC collectors.

Bear in mind though that the OS will need memory, so will the row cache and the 
filing system. Although memory usage will depend on the workload of your system.

I'm sure you'll also get good advice from other members of the mailing list.

Thanks

Jabbar Azam


On 21 June 2013 18:49, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
We have a 3-node cassandra cluster on AWS. These nodes are running cassandra 
1.2.2 and have 8GB memory. We didn't change any of the default heap or GC 
settings. So each node is allocating 1.8GB of heap space. The rows are wide; 
each row stores around 260,000 columns. We are reading the data using Astyanax. 
If our application tries to read 80,000 columns each from 10 or more rows at 
the same time, some of the nodes run out of heap space and terminate with OOM 
error. Here is the error message:

java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:50)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:126)
at 
org.apache.cassandra.db.filter.ColumnCounter$GroupByPrefix.count(ColumnCounter.java:96)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:164)
at 
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at 
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
at org.apache.cassandra.db.Table.getRow(Table.java:355)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
   at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

ERROR 02:14:05,351 Exception in thread Thread[Thrift:6,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.lang.Long.toString(Long.java:269)
at java.lang.Long.toString(Long.java:764)
at 
org.apache.cassandra.dht.Murmur3Partitioner$1.toString(Murmur3Partitioner.java:171)
at 
org.apache.cassandra.service.StorageService.describeRing(StorageService.java:1068)
at 
org.apache.cassandra.thrift.CassandraServer.describe_ring(CassandraServer.java:1192)
at 
org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3766)
at 
org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3754)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at 
org.apache.cassandra.

Re: Cassandra terminates with OutOfMemory (OOM) error

2013-06-25 Thread Mohammed Guller
Replication is 3 and read consistency level is one. One of the non-cordinator 
mode is crashing, so the OOM is happening before aggregation of the data to be 
returned.

Thanks for the info about the space allocated to young generation heap. That is 
helpful.

Mohammed

On Jun 25, 2013, at 1:28 PM, "sankalp kohli" 
mailto:kohlisank...@gmail.com>> wrote:

Your young gen is 1/4 of 1.8G which is 450MB. Also in slice queries, the 
co-ordinator will get the results from replicas as per consistency level used 
and merge the results before returning to the client.
What is the replication in your keyspace and what consistency you are reading 
with.
Also 55MB on disk will not mean 55MB in memory. The data is compressed on disk 
and also there are other overheads.



On Mon, Jun 24, 2013 at 8:38 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
No deletes. In my test, I am just writing and reading data.

There is a lot of GC, but only on the younger generation. Cassandra terminates 
before the GC for old generation kicks in.

I know that our queries are reading an unusual amount of data. However, I 
expected it to throw a timeout exception instead of crashing. Also, don't 
understand why 1.8 Gb heap is getting full when the total data stored in the 
entire Cassandra cluster is less than 55 MB.

Mohammed

On Jun 21, 2013, at 7:30 PM, "sankalp kohli" 
mailto:kohlisank...@gmail.com>> wrote:

Looks like you are putting lot of pressure on the heap by doing a slice query 
on a large row.
Do you have lot of deletes/tombstone on the rows? That might be causing a 
problem.
Also why are you returning so many columns as once, you can use auto paginate 
feature in Astyanax.

Also do you see lot of GC happening?


On Fri, Jun 21, 2013 at 1:13 PM, Jabbar Azam 
mailto:aja...@gmail.com>> wrote:
Hello Mohammed,

You should increase the heap space. You should also tune the garbage collection 
so young generation objects are collected faster, relieving pressure on heap We 
have been using jdk 7 and it uses G1 as the default collector. It does a better 
job than me trying to optimise the JDK 6 GC collectors.

Bear in mind though that the OS will need memory, so will the row cache and the 
filing system. Although memory usage will depend on the workload of your system.

I'm sure you'll also get good advice from other members of the mailing list.

Thanks

Jabbar Azam


On 21 June 2013 18:49, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
We have a 3-node cassandra cluster on AWS. These nodes are running cassandra 
1.2.2 and have 8GB memory. We didn't change any of the default heap or GC 
settings. So each node is allocating 1.8GB of heap space. The rows are wide; 
each row stores around 260,000 columns. We are reading the data using Astyanax. 
If our application tries to read 80,000 columns each from 10 or more rows at 
the same time, some of the nodes run out of heap space and terminate with OOM 
error. Here is the error message:

java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:50)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:126)
at 
org.apache.cassandra.db.filter.ColumnCounter$GroupByPrefix.count(ColumnCounter.java:96)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:164)
at 
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at 
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
at org.apache.cassandra.db.Table.getRow(Table.java:355)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
   at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.ja

Re: Cassandra terminates with OutOfMemory (OOM) error

2013-06-30 Thread Mohammed Guller
Yes, it is one read request.

Since Cassandra does not support GROUP BY, I was trying to implement it in our 
application. Hence the need to read large amount of data.  I guess that was a 
bad idea.

Mohammed

On Jun 27, 2013, at 9:54 PM, "aaron morton" 
mailto:aa...@thelastpickle.com>> wrote:

If our application tries to read 80,000 columns each from 10 or more rows at 
the same time, some of the nodes run out of heap space and terminate with OOM 
error.
Is this in one read request ?

Reading 80K columns is too many, try reading a few hundred at most.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/06/2013, at 3:57 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:

Replication is 3 and read consistency level is one. One of the non-cordinator 
mode is crashing, so the OOM is happening before aggregation of the data to be 
returned.

Thanks for the info about the space allocated to young generation heap. That is 
helpful.

Mohammed

On Jun 25, 2013, at 1:28 PM, "sankalp kohli" 
mailto:kohlisank...@gmail.com>> wrote:

Your young gen is 1/4 of 1.8G which is 450MB. Also in slice queries, the 
co-ordinator will get the results from replicas as per consistency level used 
and merge the results before returning to the client.
What is the replication in your keyspace and what consistency you are reading 
with.
Also 55MB on disk will not mean 55MB in memory. The data is compressed on disk 
and also there are other overheads.



On Mon, Jun 24, 2013 at 8:38 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
No deletes. In my test, I am just writing and reading data.

There is a lot of GC, but only on the younger generation. Cassandra terminates 
before the GC for old generation kicks in.

I know that our queries are reading an unusual amount of data. However, I 
expected it to throw a timeout exception instead of crashing. Also, don't 
understand why 1.8 Gb heap is getting full when the total data stored in the 
entire Cassandra cluster is less than 55 MB.

Mohammed

On Jun 21, 2013, at 7:30 PM, "sankalp kohli" 
mailto:kohlisank...@gmail.com>> wrote:

Looks like you are putting lot of pressure on the heap by doing a slice query 
on a large row.
Do you have lot of deletes/tombstone on the rows? That might be causing a 
problem.
Also why are you returning so many columns as once, you can use auto paginate 
feature in Astyanax.

Also do you see lot of GC happening?


On Fri, Jun 21, 2013 at 1:13 PM, Jabbar Azam 
mailto:aja...@gmail.com>> wrote:
Hello Mohammed,

You should increase the heap space. You should also tune the garbage collection 
so young generation objects are collected faster, relieving pressure on heap We 
have been using jdk 7 and it uses G1 as the default collector. It does a better 
job than me trying to optimise the JDK 6 GC collectors.

Bear in mind though that the OS will need memory, so will the row cache and the 
filing system. Although memory usage will depend on the workload of your system.

I'm sure you'll also get good advice from other members of the mailing list.

Thanks

Jabbar Azam


On 21 June 2013 18:49, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
We have a 3-node cassandra cluster on AWS. These nodes are running cassandra 
1.2.2 and have 8GB memory. We didn't change any of the default heap or GC 
settings. So each node is allocating 1.8GB of heap space. The rows are wide; 
each row stores around 260,000 columns. We are reading the data using Astyanax. 
If our application tries to read 80,000 columns each from 10 or more rows at 
the same time, some of the nodes run out of heap space and terminate with OOM 
error. Here is the error message:

java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:50)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:126)
at 
org.apache.cassandra.db.filter.ColumnCounter$GroupByPrefix.count(ColumnCounter.java:96)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:164)
at 
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at 
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
at 
org.a

RE: Accessing Cassandra data from Spark Shell

2016-05-10 Thread Mohammed Guller
Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Monday, May 9, 2016 9:28 PM
To: user@cassandra.apache.org; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
mailto:lcas...@gmail.com>> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


RE: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Mohammed Guller
As Ben mentioned, Spark 1.5.2 does work with C*.  Make sure that you are using 
the correct version of the Spark Cassandra Connector.


Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Tuesday, May 17, 2016 11:00 PM
To: user@cassandra.apache.org; Mohammed Guller
Cc: user
Subject: Re: Accessing Cassandra data from Spark Shell

It definitely should be possible for 1.5.2 (I have used it with spark-shell and 
cassandra connector with 1.4.x). The main trick is in lining up all the 
versions and building an appropriate connector jar.

Cheers
Ben

On Wed, 18 May 2016 at 15:40 Cassa L 
mailto:lcas...@gmail.com>> wrote:
Hi,
I followed instructions to run SparkShell with Spark-1.6. It works fine. 
However, I need to use spark-1.5.2 version. With it, it does not work. I keep 
getting NoSuchMethod Errors. Is there any issue running Spark Shell for 
Cassandra using older version of Spark?


Regards,
LCassa

On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Ben Slater 
[mailto:ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>]
Sent: Monday, May 9, 2016 9:28 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
mailto:lcas...@gmail.com>> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798

--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


RE: select many rows one time or select many times?

2014-08-01 Thread Mohammed Guller
Did you benchmark these two options:

1)  Select with IN

2)  Select all words and filter in application

Mohammed

From: Philo Yang [mailto:ud1...@gmail.com]
Sent: Thursday, July 31, 2014 10:45 AM
To: user@cassandra.apache.org
Subject: select many rows one time or select many times?

Hi all,

I have a cluster of 2.0.6 and one of my tables is like this:
CREATE TABLE word (
  user text,
  word text,
  flag double,
  PRIMARY KEY (user, word)
)

each "user" has about 1 "word" per node. I have a requirement of selecting 
all rows where user='someuser' and word is in a large set whose size is about 
1000 .

In C* document, it is not recommended to use "select ... in" just like:

select from word where user='someuser' and word in ('a','b','aa','ab',...)

So now I select all rows where user='someuser' and filtrate them via client 
rather than via C*. Of course, I use Datastax Java Driver to page the resultset 
by setFetchSize(1000).  Is it the best way? I found the system's load is high 
because of large range query, should I change to select for only one row each 
time and select 1000 times?

just like:
select from word where user='someuser' and word = 'a';
select from word where user='someuser' and word = 'b';
select from word where user='someuser' and word = 'c';
.

Which method will cause lower pressure on Cassandra cluster?

Thanks,
Philo Yang



RE: Number of columns per row for composite columns?

2014-08-13 Thread Mohammed Guller
4


Mohammed

From: hlqv [mailto:hlqvu...@gmail.com]
Sent: Tuesday, August 12, 2014 11:44 PM
To: user@cassandra.apache.org
Subject: Re: Number of columns per row for composite columns?

For more specifically, I declared a column family

create column family Column_Family
with key_validation_class = UTF8Type
and comparator = 'CompositeType(LongType,UTF8Type)'
and default_validation_class = UTF8Type;
Number of columns will depend on only first column name in composite column or 
both.
For example,
With row key  = 1, I have data
1 | 20140813, user1 | value1
1 | 20140813, user2 | value2
1 | 20140814, user1 | value3
1 | 20140814, user2 | value4
(1: rowkey, "20140813, user1": composite column, "value1" : the value of column)

So the number of columns of row key 1 will be 2 or 4? (2 for 20140813 and 
20140814, 4 for each distinct composite column)
Thank you so much

On 13 August 2014 03:18, Jack Krupansky 
mailto:j...@basetechnology.com>> wrote:
Your question is a little too tangled for me... Are you asking about rows in a 
partition (some people call that a “storage row”) or columns per row? The 
latter is simply the number of columns that you have declared in your table.

The total number of columns – or more properly, “cells” – in a partition would 
be the number of rows you have inserted in that partition times the number of 
columns you have declared in the table.

If you need to review the terminology:
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

-- Jack Krupansky

From: hlqv
Sent: Tuesday, August 12, 2014 1:13 PM
To: user@cassandra.apache.org
Subject: Number of columns per row for composite columns?

Hi everyone,
I'm confused with number of columns in a row of Cassandra, as far as I know 
there is 2 billions columns per row. Like that if I have a composite column 
name in each row, for ex: (timestamp, userid), then number of columns per row 
is the number of distinct 'timestamp' or each distinct 'timestamp, userid' is a 
column?



no change observed in read latency after switching from EBS to SSD storage

2014-09-16 Thread Mohammed Guller
Hi -

We are running Cassandra 2.0.5 on AWS on m3.large instances. These instances 
were using EBS for storage (I know it is not recommended). We replaced the EBS 
storage with SSDs. However, we didn't see any change in read latency. A query 
that took 10 seconds when data was stored on EBS still takes 10 seconds even 
after we moved the data directory to SSD. It is a large query returning 200,000 
CQL rows from a single partition. We are reading 3 columns from each row and 
the combined data in these three columns for each row is around 100 bytes. In 
other words, the raw data returned by the query is approximately 20MB.

I was expecting at least 5-10 times reduction in read latency going from EBS to 
SSD, so I am puzzled why we are not seeing any change in performance.

Does anyone have insight as to why we don't see any performance impact on the 
reads going from EBS to SSD?

Thanks,
Mohammed



RE: no change observed in read latency after switching from EBS to SSD storage

2014-09-16 Thread Mohammed Guller
Rob,
The 10 seconds latency that I gave earlier is from CQL tracing. Almost 5 
seconds out of that was taken up by the “merge memtable and sstables” step. The 
remaining 5 seconds are from “read live and tombstoned cells.”

I too first thought that maybe disk is not the bottleneck and Cassandra is 
serving everything from cache, but in that case, it should not take 10 seconds 
for reading just 20MB data.

Also, I narrowed down the query to limit it to a single partition read and I 
ran the query in cqlsh running on the same node. I turned on tracing, which 
shows that all the steps got executed on the same node. htop shows that CPU and 
memory are not the bottlenecks. Network should not come into play since the 
cqlsh is running on the same node.

Is there any performance tuning parameter in the cassandra.yaml file for large 
reads?

Mohammed

From: Robert Coli [mailto:rc...@eventbrite.com]
Sent: Tuesday, September 16, 2014 5:42 PM
To: user@cassandra.apache.org
Subject: Re: no change observed in read latency after switching from EBS to SSD 
storage

On Tue, Sep 16, 2014 at 5:35 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Does anyone have insight as to why we don't see any performance impact on the 
reads going from EBS to SSD?

What does it say when you enable tracing on this CQL query?

10 seconds is a really long time to access anything in Cassandra. There is, 
generally speaking, a reason why the default timeouts are lower than this.

My conjecture is that the data in question was previously being served from the 
page cache and is now being served from SSD. You have, in switching from 
EBS-plus-page-cache to SSD successfully proved that SSD and RAM are both very 
fast. There is also a strong suggestion that whatever access pattern you are 
using is not bounded by disk performance.

=Rob



RE: no change observed in read latency after switching from EBS to SSD storage

2014-09-17 Thread Mohammed Guller
Thank you all for your responses.

Alex –
  Instance (ephemeral) SSD

Ben –
the query reads data from just one partition. If disk i/o is the bottleneck, 
then in theory, if reading from EBS takes 10 seconds, then it should take lot 
less when reading the same amount of data from local SSD. My question is not 
about why it is taking 10 seconds, but why is the read time same for both EBS 
(network attached storage) and local SSD?

Tony –
if the data was cached in memory, then a read should not take 10 seconds just 
for 20MB data

Rob –
Here is the schema, query, and trace. I masked the actual column names to 
protect the innocents ☺

create table dummy(
  a   varchar,
  b   varchar,
  c   varchar,
  d   varchar,
  e   varchar,
  f   varchar,
  g   varchar,
  h   timestamp,
  i   int,
  non_key1   varchar,
  ...
  non_keyN   varchar,
  PRIMARY KEY ((a, b, c, d, e, f), g, h, i)
) WITH CLUSTERING ORDER BY (g ASC, h DESC, i ASC)

SELECT h, non_key100, non_key200 FROM dummy WHERE a='' AND b='bb' AND 
c='ccc' AND d='dd' AND e='' AND f='ff' AND g='g'AND 
h >='2014-09-10T00:00:00' AND h<='2014-09-10T23:40:41';

The above query returns around 250,000 CQL rows.

cqlsh trace:

activity | timestamp| source  | source_elapsed
-
execute_cql3_query | 21:57:16,830 | 10.10.100.5 |  0
Parsing query; | 21:57:16,830 | 10.10.100.5 |673
Preparing statement | 21:57:16,831 | 10.10.100.5 |   1602
Executing single-partition query on event | 21:57:16,845 | 10.10.100.5 |
  14871
Acquiring sstable references | 21:57:16,845 | 10.10.100.5 |  14896
Merging memtable tombstones | 21:57:16,845 | 10.10.100.5 |  14954
Bloom filter allows skipping sstable 1049 | 21:57:16,845 | 10.10.100.5 |
  15090
Bloom filter allows skipping sstable 989 | 21:57:16,845 | 10.10.100.5 | 
 15146
Partition index with 0 entries found for sstable 937 | 21:57:16,845 | 
10.10.100.5 |  15565
Seeking to partition indexed section in data file | 21:57:16,845 | 10.10.100.5 
|  15581
Partition index with 7158 entries found for sstable 884 | 21:57:16,898 | 
10.10.100.5 |  68644
Seeking to partition indexed section in data file | 21:57:16,899 | 10.10.100.5 
|  69014
Partition index with 20819 entries found for sstable 733 | 21:57:16,916 | 
10.10.100.5 |  86121
Seeking to partition indexed section in data file | 21:57:16,916 | 10.10.100.5 
|  86412
Skipped 1/6 non-slice-intersecting sstables, included 0 due to tombstones | 
21:57:16,916 | 10.10.100.5 |  86494
Merging data from memtables and 3 sstables | 21:57:16,916 | 10.10.100.5 |   
   86522
Read 193311 live and 0 tombstoned cells | 21:57:24,552 | 10.10.100.5 |
7722425
Request complete | 21:57:29,074 | 10.10.100.5 |   12244832


Mohammed

From: Alex Major [mailto:al3...@gmail.com]
Sent: Wednesday, September 17, 2014 3:47 AM
To: user@cassandra.apache.org
Subject: Re: no change observed in read latency after switching from EBS to SSD 
storage

When you say you moved from EBS to SSD, do you mean the EBS HDD drives to EBS 
SSD drives? Or instance SSD drives? The m3.large only comes with 32GB of 
instance based SSD storage. If you're using EBS SSD drives then network will 
still be the slowest thing so switching won't likely make much of a difference.

On Wed, Sep 17, 2014 at 6:00 AM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Rob,
The 10 seconds latency that I gave earlier is from CQL tracing. Almost 5 
seconds out of that was taken up by the “merge memtable and sstables” step. The 
remaining 5 seconds are from “read live and tombstoned cells.”

I too first thought that maybe disk is not the bottleneck and Cassandra is 
serving everything from cache, but in that case, it should not take 10 seconds 
for reading just 20MB data.

Also, I narrowed down the query to limit it to a single partition read and I 
ran the query in cqlsh running on the same node. I turned on tracing, which 
shows that all the steps got executed on the same node. htop shows that CPU and 
memory are not the bottlenecks. Network should not come into play since the 
cqlsh is running on the same node.

Is there any performance tuning parameter in the cassandra.yaml file for large 
reads?

Mohammed

From: Robert Coli [mailto:rc...@eventbrite.com<mailto:rc...@eventbrite.com>]
Sent: Tuesday, September 16, 2014 5:42 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: no change observed in read latency after switching from EBS to SSD 
storage

On Tue, Sep 16, 2014 at 5:35 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Does anyone have insight as to why we don't see any performance impact on the 
reads going from EBS

RE: no change observed in read latency after switching from EBS to SSD storage

2014-09-17 Thread Mohammed Guller
Chris,
I agree that reading 250k row is a bit excessive and that breaking up the 
partition would help reduce the query time. That part is well understood. The 
part that we can't figure out is why read time did not change when we switched 
from a slow Network Attached Storage (AWS EBS) to local SSD.

One possibility is that the read is not bound by disk i/o, but it is not cpu or 
memory bound either. So where is it spending all that time? Another possibility 
is that even though it is returning only 193311 cells, C* reads the entire 
partition, which may have a lot more cells. But even in that case reading from 
a local SSD should have been a lot faster than reading from non-provisioned EBS.

Mohammed

From: Chris Lohfink [mailto:clohf...@blackbirdit.com]
Sent: Wednesday, September 17, 2014 7:17 PM
To: user@cassandra.apache.org
Subject: Re: no change observed in read latency after switching from EBS to SSD 
storage

"Read 193311 live and 0 tombstoned cells "

is your killer.  returning 250k rows is a bit excessive, you should really page 
this in smaller chunks, what client are you using to access the data?  This 
partition (a, b, c, d, e, f) may be too large as well (can check partition max 
size from output of nodetool cfstats), may be worth including g to break it up 
more - but I dont know enough about your data model.

---
Chris Lohfink

On Sep 17, 2014, at 4:53 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:


Thank you all for your responses.

Alex -
  Instance (ephemeral) SSD

Ben -
the query reads data from just one partition. If disk i/o is the bottleneck, 
then in theory, if reading from EBS takes 10 seconds, then it should take lot 
less when reading the same amount of data from local SSD. My question is not 
about why it is taking 10 seconds, but why is the read time same for both EBS 
(network attached storage) and local SSD?

Tony -
if the data was cached in memory, then a read should not take 10 seconds just 
for 20MB data

Rob -
Here is the schema, query, and trace. I masked the actual column names to 
protect the innocents :)

create table dummy(
  a   varchar,
  b   varchar,
  c   varchar,
  d   varchar,
  e   varchar,
  f   varchar,
  g   varchar,
  h   timestamp,
  i   int,
  non_key1   varchar,
  ...
  non_keyN   varchar,
  PRIMARY KEY ((a, b, c, d, e, f), g, h, i)
) WITH CLUSTERING ORDER BY (g ASC, h DESC, i ASC)

SELECT h, non_key100, non_key200 FROM dummy WHERE a='' AND b='bb' AND 
c='ccc' AND d='dd' AND e='' AND f='ff' AND g='g'AND 
h >='2014-09-10T00:00:00' AND h<='2014-09-10T23:40:41';

The above query returns around 250,000 CQL rows.

cqlsh trace:

activity | timestamp| source  | source_elapsed
-
execute_cql3_query | 21:57:16,830 | 10.10.100.5 |  0
Parsing query; | 21:57:16,830 | 10.10.100.5 |673
Preparing statement | 21:57:16,831 | 10.10.100.5 |   1602
Executing single-partition query on event | 21:57:16,845 | 10.10.100.5 |
  14871
Acquiring sstable references | 21:57:16,845 | 10.10.100.5 |  14896
Merging memtable tombstones | 21:57:16,845 | 10.10.100.5 |  14954
Bloom filter allows skipping sstable 1049 | 21:57:16,845 | 10.10.100.5 |
  15090
Bloom filter allows skipping sstable 989 | 21:57:16,845 | 10.10.100.5 | 
 15146
Partition index with 0 entries found for sstable 937 | 21:57:16,845 | 
10.10.100.5 |  15565
Seeking to partition indexed section in data file | 21:57:16,845 | 10.10.100.5 
|  15581
Partition index with 7158 entries found for sstable 884 | 21:57:16,898 | 
10.10.100.5 |  68644
Seeking to partition indexed section in data file | 21:57:16,899 | 10.10.100.5 
|  69014
Partition index with 20819 entries found for sstable 733 | 21:57:16,916 | 
10.10.100.5 |  86121
Seeking to partition indexed section in data file | 21:57:16,916 | 10.10.100.5 
|  86412
Skipped 1/6 non-slice-intersecting sstables, included 0 due to tombstones | 
21:57:16,916 | 10.10.100.5 |  86494
Merging data from memtables and 3 sstables | 21:57:16,916 | 10.10.100.5 |   
   86522
Read 193311 live and 0 tombstoned cells | 21:57:24,552 | 10.10.100.5 |
7722425
Request complete | 21:57:29,074 | 10.10.100.5 |   12244832


Mohammed

From: Alex Major [mailto:al3...@gmail.com]
Sent: Wednesday, September 17, 2014 3:47 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: no change observed in read latency after switching from EBS to SSD 
storage

When you say you moved from EBS to SSD, do you mean the EBS HDD drives to EBS 
SSD drives? Or instance SSD drives? The m3.large only comes with 32GB of 
instance based SSD storage. If you're using EBS SSD drives then network will 
s

RE: no change observed in read latency after switching from EBS to SSD storage

2014-09-18 Thread Mohammed Guller
Benedict,
That makes perfect sense. Even though the node has multiple cores, I do see 
that only one core is pegged at 100%.

Interestingly, after I switched to 2.1, cqlsh trace now shows that the same 
query takes only 600ms. However, cqlsh still waits for almost 20-30 seconds 
before it starts showing the result. I noticed similar latency when I ran the 
query from our app, which uses the Astyanax driver. So I thought perhaps there 
is a bug in the cqlsh code that tracks the statistics and the reported numbers 
are incorrect. But, I guess the numbers shown by cqlsh trace is correct, but 
the bottleneck is somewhere else now. In other words, the read operation itself 
is much faster in 2.1, but something else delays the response back to the 
client.

Mohammed

From: Benedict Elliott Smith [mailto:belliottsm...@datastax.com]
Sent: Thursday, September 18, 2014 2:15 AM
To: user@cassandra.apache.org
Cc: Chris Lohfink
Subject: Re: no change observed in read latency after switching from EBS to SSD 
storage

It is possible this is CPU bound. In 2.1 we have optimised the comparison of 
clustering columns 
(CASSANDRA-5417<https://issues.apache.org/jira/browse/CASSANDRA-5417>), but in 
2.0 it is quite expensive. So for a large row with several million comparisons 
to perform (to merge, filter, etc.) it could be a significant proportion of the 
cost. Note that these costs for a given query are all bound by a single core, 
there is no parallelism, since the assumption is we are serving more queries at 
once than there are cores (in general Cassandra is not designed to serve 
workloads consisting of single large queries, at least not yet)

On Thu, Sep 18, 2014 at 7:29 AM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Chris,
I agree that reading 250k row is a bit excessive and that breaking up the 
partition would help reduce the query time. That part is well understood. The 
part that we can’t figure out is why read time did not change when we switched 
from a slow Network Attached Storage (AWS EBS) to local SSD.

One possibility is that the read is not bound by disk i/o, but it is not cpu or 
memory bound either. So where is it spending all that time? Another possibility 
is that even though it is returning only 193311 cells, C* reads the entire 
partition, which may have a lot more cells. But even in that case reading from 
a local SSD should have been a lot faster than reading from non-provisioned EBS.

Mohammed

From: Chris Lohfink 
[mailto:clohf...@blackbirdit.com<mailto:clohf...@blackbirdit.com>]
Sent: Wednesday, September 17, 2014 7:17 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: no change observed in read latency after switching from EBS to SSD 
storage

"Read 193311 live and 0 tombstoned cells "

is your killer.  returning 250k rows is a bit excessive, you should really page 
this in smaller chunks, what client are you using to access the data?  This 
partition (a, b, c, d, e, f) may be too large as well (can check partition max 
size from output of nodetool cfstats), may be worth including g to break it up 
more - but I dont know enough about your data model.

---
Chris Lohfink

On Sep 17, 2014, at 4:53 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:

Thank you all for your responses.

Alex –
  Instance (ephemeral) SSD

Ben –
the query reads data from just one partition. If disk i/o is the bottleneck, 
then in theory, if reading from EBS takes 10 seconds, then it should take lot 
less when reading the same amount of data from local SSD. My question is not 
about why it is taking 10 seconds, but why is the read time same for both EBS 
(network attached storage) and local SSD?

Tony –
if the data was cached in memory, then a read should not take 10 seconds just 
for 20MB data

Rob –
Here is the schema, query, and trace. I masked the actual column names to 
protect the innocents ☺

create table dummy(
  a   varchar,
  b   varchar,
  c   varchar,
  d   varchar,
  e   varchar,
  f   varchar,
  g   varchar,
  h   timestamp,
  i   int,
  non_key1   varchar,
  ...
  non_keyN   varchar,
  PRIMARY KEY ((a, b, c, d, e, f), g, h, i)
) WITH CLUSTERING ORDER BY (g ASC, h DESC, i ASC)

SELECT h, non_key100, non_key200 FROM dummy WHERE a='' AND b='bb' AND 
c='ccc' AND d='dd' AND e='' AND f='ff' AND g='g'AND 
h >='2014-09-10T00:00:00' AND h<='2014-09-10T23:40:41';

The above query returns around 250,000 CQL rows.

cqlsh trace:

activity | timestamp| source  | source_elapsed
-
execute_cql3_query | 21:57:16,830 | 10.10.100.5 |  0
Parsing query; | 21:57:16,830 | 10.10.100.5 |673
Preparing statement | 21:57:16,831 | 10.10.100.5 |   1602
Executing single-partition query on event | 

RE: What will be system configuration for retrieving few "GB" of data

2014-10-17 Thread Mohammed Guller
With 8GB RAM, the default heap size is 2GB, so you will quickly start running 
out of heap space if you do large reads. What is a large read? It depends on 
the number of columns in each row and data in each column. It could 100,000 
rows for some and 300,000 for others. In addition, remember that Java adds a 
lot of overhead to data in memory, so a 8 character string will not occupy just 
8 bytes in memory, but a lot more.

In general, avoid large reads in C*. If it is absolutely must and you cannot 
repartition the data, then use a driver that supports paging.

Mohammed

From: Umang Shah [mailto:shahuma...@gmail.com]
Sent: Wednesday, October 15, 2014 10:46 PM
To: user@cassandra.apache.org
Subject: What will be system configuration for retrieving few "GB" of data

Hi,

I am facing many problem after storing certain limit of records in cassandra, 
and giving outofmemoryerror.

I have 8GB of RAM in my system, so how much records i can expect to retrieve by 
using select query?

and what will be the configuration for those people who are retrieving 15-20 GB 
of data?

Can somebody explain me how to improve read performance then it will be great 
help, i tried
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_tune_jvm_c.html

such things but no help.


--
Regards,
Umang Shah
shahuma...@gmail.com


querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-19 Thread Mohammed Guller
Hi - I was curious if anyone is using the Spark SQL Thrift JDBC server with 
Cassandra. It would be great be if you could share how you got it working? For 
example, what config changes have to be done in hive-site.xml, what additional 
jars are required, etc.?

I have a Spark app that can programmatically query data from Cassandra using 
Spark SQL and Spark-Cassandra-Connector. No problem there, but I couldn't find 
any documentation for using the Thrift JDBC server for querying data from 
Cassandra.

Thanks,
Mohammed



batch_size_warn_threshold_in_kb

2014-12-11 Thread Mohammed Guller
Hi -
The cassandra.yaml file has property called batch_size_warn_threshold_in_kb.
The default size is 5kb and according to the comments in the yaml file, it is 
used to log WARN on any batch size exceeding this value in kilobytes. It says 
caution should be taken on increasing the size of this threshold as it can lead 
to node instability.

Does anybody know the significance of this magic number 5kb? Why would a higher 
number (say 10kb) lead to node instability?

Mohammed


RE: batch_size_warn_threshold_in_kb

2014-12-11 Thread Mohammed Guller
Ryan,
Thanks for the quick response.

I did see that jira before posting my question on this list. However, I didn’t 
see any information about why 5kb+ data will cause instability. 5kb or even 
50kb seems too small. For example, if each mutation is 1000+ bytes, then with 
just 5 mutations, you will hit that threshold.

In addition, Patrick is saying that he does not recommend more than 100 
mutations per batch. So why not warn users just on the # of mutations in a 
batch?

Mohammed

From: Ryan Svihla [mailto:rsvi...@datastax.com]
Sent: Thursday, December 11, 2014 12:56 PM
To: user@cassandra.apache.org
Subject: Re: batch_size_warn_threshold_in_kb

Nothing magic, just put in there based on experience. You can find the story 
behind the original recommendation here

https://issues.apache.org/jira/browse/CASSANDRA-6487

Key reasoning for the desire comes from Patrick McFadden:

"Yes that was in bytes. Just in my own experience, I don't recommend more than 
~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 
byte mutations.

Totally up for debate."

It's totally changeable, however, it's there in no small part because so many 
people confuse the BATCH keyword as a performance optimization, this helps flag 
those cases of misuse.

On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –
The cassandra.yaml file has property called batch_size_warn_threshold_in_kb.
The default size is 5kb and according to the comments in the yaml file, it is 
used to log WARN on any batch size exceeding this value in kilobytes. It says 
caution should be taken on increasing the size of this threshold as it can lead 
to node instability.

Does anybody know the significance of this magic number 5kb? Why would a higher 
number (say 10kb) lead to node instability?

Mohammed


--

[datastax_logo.png]<http://www.datastax.com/>

Ryan Svihla

Solution Architect

[twitter.png]<https://twitter.com/foundev>[linkedin.png]<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>


DataStax is the fastest, most scalable distributed database technology, 
delivering Apache Cassandra to the world’s most innovative enterprises. 
Datastax is built to be agile, always-on, and predictably scalable to any size. 
With more than 500 customers in 45 countries, DataStax is the database 
technology and transactional backbone of choice for the worlds most innovative 
companies such as Netflix, Adobe, Intuit, and eBay.



C* throws OOM error despite use of automatic paging

2015-01-08 Thread Mohammed Guller
Hi -

We have an ETL application that reads all rows from Cassandra (2.1.2), filters 
them and stores a small subset in an RDBMS. Our application is using Datastax's 
Java driver (2.1.4) to fetch data from the C* nodes. Since the Java driver 
supports automatic paging, I was under the impression that SELECT queries 
should not cause an OOM error on the C* nodes. However, even with just 16GB 
data on each nodes, the C* nodes start throwing OOM error as soon as the 
application starts iterating through the rows of a table.

The application code looks something like this:

Statement stmt = new SimpleStatement("SELECT x,y,z FROM cf").setFetchSize(5000);
ResultSet rs = session.execute(stmt);
while (!rs.isExhausted()){
  row = rs.one()
  process(row)
}

Even after we reduced the page size to 1000, the C* nodes still crash. C* is 
running on M3.xlarge machines (4-cores, 15GB). We manually increased the heap 
size to 8GB just to see how much heap C* consumes. With 10-15 minutes, the heap 
usage climbs up to 7.6GB. That does not make sense. Either automatic paging is 
not working or we are missing something.

Does anybody have insights as to what could be happening? Thanks.

Mohammed




RE: C* throws OOM error despite use of automatic paging

2015-01-10 Thread Mohammed Guller
Hi Jens,
Thank you for sharing the results of your tests.

I even tried setFetchSize with 100 and it didn't help much. I am coming to the 
conclusion that the correct number for setFetchSize depends on the data. In 
some cases, default is fine, whereas in others it needs to be significantly 
lower than 5000. As you mentioned, that leaves a lot of operational risk for 
production use. 

It would be great if there was some documented guidelines on how to select the 
correct number for setFetchSize.

Mohammed


-Original Message-
From: Jens-U. Mozdzen [mailto:jmozd...@nde.ag] 
Sent: Friday, January 9, 2015 4:02 AM
To: user@cassandra.apache.org
Subject: Re: C* throws OOM error despite use of automatic paging

Hi Mohammed,

Zitat von Mohammed Guller :
> Hi -
>
> We have an ETL application that reads all rows from Cassandra (2.1.2), 
> filters them and stores a small subset in an RDBMS. Our application is 
> using Datastax's Java driver (2.1.4) to fetch data from the C* nodes. 
> Since the Java driver supports automatic paging, I was under the 
> impression that SELECT queries should not cause an OOM error on the C* 
> nodes. However, even with just 16GB data on each nodes, the C* nodes 
> start throwing OOM error as soon as the application starts iterating 
> through the rows of a table.
>
> The application code looks something like this:
>
> Statement stmt = new SimpleStatement("SELECT x,y,z FROM 
> cf").setFetchSize(5000); ResultSet rs = session.execute(stmt); while 
> (!rs.isExhausted()){
>   row = rs.one()
>   process(row)
> }
>
> Even after we reduced the page size to 1000, the C* nodes still crash. 
> C* is running on M3.xlarge machines (4-cores, 15GB).

I've been running a few tests to determine the effect of
setFetchSize() on heap pressure on the Cassandra nodes and came to the 
conclusion that a limit of "500" is much more helpful than values above 
"1000"... with too high values, we managed to put that much pressure on the 
nodes that we had to restart them.

This, btw, leaves a lot of operational risk for production use. I've i.e. found 
no way to influence time-outs or fetch size with the Datastax JDBC driver, with 
according consequences on the queries
(time-outs) and C* node behavior (esp. heap pressure). Hence, operating a C* 
cluster needs a lot of trust in the skills of the "users" 
(developers/maintainers of the client-side solutions) and their tools :( .

Regards,
Jens






RE: C* throws OOM error despite use of automatic paging

2015-01-10 Thread Mohammed Guller
nodetool cfstats shows 9GB. We are storing simple primitive value. No blobs or 
collections.

Mohammed

From: DuyHai Doan [mailto:doanduy...@gmail.com]
Sent: Friday, January 9, 2015 12:51 AM
To: user@cassandra.apache.org
Subject: Re: C* throws OOM error despite use of automatic paging

What is the data size of the column family you're trying to fetch with paging ? 
Are you storing big blob or just primitive values ?

On Fri, Jan 9, 2015 at 8:33 AM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –

We have an ETL application that reads all rows from Cassandra (2.1.2), filters 
them and stores a small subset in an RDBMS. Our application is using Datastax’s 
Java driver (2.1.4) to fetch data from the C* nodes. Since the Java driver 
supports automatic paging, I was under the impression that SELECT queries 
should not cause an OOM error on the C* nodes. However, even with just 16GB 
data on each nodes, the C* nodes start throwing OOM error as soon as the 
application starts iterating through the rows of a table.

The application code looks something like this:

Statement stmt = new SimpleStatement("SELECT x,y,z FROM cf").setFetchSize(5000);
ResultSet rs = session.execute(stmt);
while (!rs.isExhausted()){
  row = rs.one()
  process(row)
}

Even after we reduced the page size to 1000, the C* nodes still crash. C* is 
running on M3.xlarge machines (4-cores, 15GB). We manually increased the heap 
size to 8GB just to see how much heap C* consumes. With 10-15 minutes, the heap 
usage climbs up to 7.6GB. That does not make sense. Either automatic paging is 
not working or we are missing something.

Does anybody have insights as to what could be happening? Thanks.

Mohammed





RE: C* throws OOM error despite use of automatic paging

2015-01-12 Thread Mohammed Guller
The heap usage is pretty low ( less than 700MB) when the application starts. I 
can see the heap usage gradually climbing once the application starts. C* does 
not log any errors before OOM happens.

Data is on EBS. Write throughput is quite high with two applications 
simultaneously pumping data into C*.


Mohammed

From: Ryan Svihla [mailto:r...@foundev.pro]
Sent: Monday, January 12, 2015 3:39 PM
To: user
Subject: Re: C* throws OOM error despite use of automatic paging

I think it's more accurate that to say that auto paging prevents one type of 
OOM. It's premature to diagnose it as 'not happening'.

What is heap usage when you start? Are you storing your data on EBS? What kind 
of write throughput do you have going on at the same time? What errors do you 
have in the cassandra logs before this crashes?


On Sat, Jan 10, 2015 at 1:48 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
nodetool cfstats shows 9GB. We are storing simple primitive value. No blobs or 
collections.

Mohammed

From: DuyHai Doan [mailto:doanduy...@gmail.com<mailto:doanduy...@gmail.com>]
Sent: Friday, January 9, 2015 12:51 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: C* throws OOM error despite use of automatic paging

What is the data size of the column family you're trying to fetch with paging ? 
Are you storing big blob or just primitive values ?

On Fri, Jan 9, 2015 at 8:33 AM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –

We have an ETL application that reads all rows from Cassandra (2.1.2), filters 
them and stores a small subset in an RDBMS. Our application is using Datastax’s 
Java driver (2.1.4) to fetch data from the C* nodes. Since the Java driver 
supports automatic paging, I was under the impression that SELECT queries 
should not cause an OOM error on the C* nodes. However, even with just 16GB 
data on each nodes, the C* nodes start throwing OOM error as soon as the 
application starts iterating through the rows of a table.

The application code looks something like this:

Statement stmt = new SimpleStatement("SELECT x,y,z FROM cf").setFetchSize(5000);
ResultSet rs = session.execute(stmt);
while (!rs.isExhausted()){
  row = rs.one()
  process(row)
}

Even after we reduced the page size to 1000, the C* nodes still crash. C* is 
running on M3.xlarge machines (4-cores, 15GB). We manually increased the heap 
size to 8GB just to see how much heap C* consumes. With 10-15 minutes, the heap 
usage climbs up to 7.6GB. That does not make sense. Either automatic paging is 
not working or we are missing something.

Does anybody have insights as to what could be happening? Thanks.

Mohammed






--

Thanks,
Ryan Svihla


Re: C* throws OOM error despite use of automatic paging

2015-01-12 Thread Mohammed Guller
There are no tombstones.

Mohammed


On Jan 12, 2015, at 9:11 PM, Dominic Letz 
mailto:dominicl...@exosite.com>> wrote:

Does your use case include many tombstones? If yes then that might explain the 
OOM situation.

If you want to know for sure you can enable the heap dump generation on crash 
in cassandra-env.sh just uncomment JVM_OPTS="$JVM_OPTS 
-XX:+HeapDumpOnOutOfMemoryError" and then run your query again. The heapdump 
will have the answer.




On Tue, Jan 13, 2015 at 10:54 AM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
The heap usage is pretty low ( less than 700MB) when the application starts. I 
can see the heap usage gradually climbing once the application starts. C* does 
not log any errors before OOM happens.

Data is on EBS. Write throughput is quite high with two applications 
simultaneously pumping data into C*.


Mohammed

From: Ryan Svihla [mailto:r...@foundev.pro<mailto:r...@foundev.pro>]
Sent: Monday, January 12, 2015 3:39 PM
To: user

Subject: Re: C* throws OOM error despite use of automatic paging

I think it's more accurate that to say that auto paging prevents one type of 
OOM. It's premature to diagnose it as 'not happening'.

What is heap usage when you start? Are you storing your data on EBS? What kind 
of write throughput do you have going on at the same time? What errors do you 
have in the cassandra logs before this crashes?


On Sat, Jan 10, 2015 at 1:48 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
nodetool cfstats shows 9GB. We are storing simple primitive value. No blobs or 
collections.

Mohammed

From: DuyHai Doan [mailto:doanduy...@gmail.com<mailto:doanduy...@gmail.com>]
Sent: Friday, January 9, 2015 12:51 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: C* throws OOM error despite use of automatic paging

What is the data size of the column family you're trying to fetch with paging ? 
Are you storing big blob or just primitive values ?

On Fri, Jan 9, 2015 at 8:33 AM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –

We have an ETL application that reads all rows from Cassandra (2.1.2), filters 
them and stores a small subset in an RDBMS. Our application is using Datastax’s 
Java driver (2.1.4) to fetch data from the C* nodes. Since the Java driver 
supports automatic paging, I was under the impression that SELECT queries 
should not cause an OOM error on the C* nodes. However, even with just 16GB 
data on each nodes, the C* nodes start throwing OOM error as soon as the 
application starts iterating through the rows of a table.

The application code looks something like this:

Statement stmt = new SimpleStatement("SELECT x,y,z FROM cf").setFetchSize(5000);
ResultSet rs = session.execute(stmt);
while (!rs.isExhausted()){
  row = rs.one()
  process(row)
}

Even after we reduced the page size to 1000, the C* nodes still crash. C* is 
running on M3.xlarge machines (4-cores, 15GB). We manually increased the heap 
size to 8GB just to see how much heap C* consumes. With 10-15 minutes, the heap 
usage climbs up to 7.6GB. That does not make sense. Either automatic paging is 
not working or we are missing something.

Does anybody have insights as to what could be happening? Thanks.

Mohammed






--

Thanks,
Ryan Svihla



--
Dominic Letz
Director of R&D
Exosite<http://exosite.com>



RE: Retrieving all row keys of a CF

2015-01-16 Thread Mohammed Guller
Ruchir,
I am curious if you had better luck with the AllRowsReader recipe.

Mohammed

From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Friday, January 16, 2015 12:33 PM
To: user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

Note that getAllRows() is deprecated in Astyanax (see 
here).

You should prefer to use the AllRowsReader recipe: 
https://github.com/Netflix/astyanax/wiki/AllRowsReader-All-rows-query

Note the section titled Reading only the row 
keys,
 which seems to match your use case exactly.  You should start getting row keys 
back very, very quickly.

On Fri, Jan 16, 2015 at 11:32 AM, Ruchir Jha 
mailto:ruchir@gmail.com>> wrote:
We have a column family that has about 800K rows and on an average about a 
million columns. I am interested in getting all the row keys in this column 
family and I am using the following Astyanax code snippet to do this.
This query never finishes (ran it for 2 days but did not finish).

This query however works with CF's that have lesser number of columns. This 
leads me to believe that there might be an API that just retrieves the row keys 
and does not depend on the number of columns in the CF. Any suggestions are 
appreciated.

I am running Cassandra 2.0.9 and this is a 4 node cluster.


keyspace.prepareQuery(this.wideRowTables.get(group)).setConsistencyLevel(ConsistencyLevel.CL_QUORUM).getAllRows().setRowLimit(1000)

.setRepeatLastToken(false).withColumnRange(new 
RangeBuilder().setLimit(1).build()).executeWithCallback(new RowCallback() {

@Override
public boolean 
failure(ConnectionException e)
{
return 
true;
}

@Override
public void 
success(Rows rows)
{
// 
iterating over rows here
}
});



RE: Retrieving all row keys of a CF

2015-01-16 Thread Mohammed Guller
A few questions:


1)  What is the heap size and total memory on each node?

2)  How big is the cluster?

3)  What are the read and range timeouts (in cassandra.yaml) on the C* 
nodes?

4)  What are the timeouts for the Astyanax client?

5)  Do you see GC pressure on the C* nodes? How long does GC for new gen 
and old gen take?

6)  Does any node crash with OOM error when you try AllRowsReader?

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 4:14 PM
To: user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

Hi,
I and Ruchir tried query using AllRowsReader recipe but had no luck. We are 
seeing PoolTimeoutException.
SEVERE: [Thread_1] Error reading RowKeys
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: 
PoolTimeoutException: [host=servername, latency=2003(2003), attempts=4]Timed 
out waiting for connection
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:198)
   at 
com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.borrowConnection(RoundRobinExecuteWithFailover.java:84)
   at 
com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:117)
   at 
com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:338)
   at 
com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$2.execute(ThriftColumnFamilyQueryImpl.java:397)
   at 
com.netflix.astyanax.recipes.reader.AllRowsReader$1.call(AllRowsReader.java:447)
   at 
com.netflix.astyanax.recipes.reader.AllRowsReader$1.call(AllRowsReader.java:419)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

We did receive a portion of data which changes on every try. We used following 
method.
boolean result = new AllRowsReader.Builder(keyspace, 
CF_STANDARD1)
.withColumnRange(null, null, false, 0)
.withPartitioner(null) // this will use keyspace's partitioner
.forEachRow(new Function, Boolean>() {
@Override
public Boolean apply(@Nullable Row row) {
// Process the row here ...
return true;
}
})
.build()
.call();

Tried setting concurrency level as mentioned in this post 
(https://github.com/Netflix/astyanax/issues/411) as well on both astyanax 
1.56.49 and 2.0.0. Still nothing.


RE: Retrieving all row keys of a CF

2015-01-16 Thread Mohammed Guller
Both total system memory and heap size can't be 8GB?

The timeout on the Astyanax client should be greater than the timeouts on the 
C* nodes, otherwise your client will timeout prematurely.

Also, have you tried increasing the timeout for the range queries to a higher 
number? It is not recommended to set them very high, because a lot of other 
problems may start happening, but then reading 800,000 partitions is not a 
normal operation.

Just as an experimentation, can you set the range timeout to 45 seconds on each 
node and the timeout on the Astyanax client to 50 seconds? Restart the nodes 
after increasing the timeout and try again.

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 5:11 PM
To: user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF


1)What is the heap size and total memory on each node? 8GB, 8GB
2)How big is the cluster? 4
3)What are the read and range timeouts (in cassandra.yaml) on the 
C* nodes? 10 secs, 10 secs
4)What are the timeouts for the Astyanax client? 2 secs
5)Do you see GC pressure on the C* nodes? How long does GC for new 
gen and old gen take? occurs every 5 secs dont see huge gc pressure, <50ms
6)Does any node crash with OOM error when you try AllRowsReader? No

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, January 16, 2015 7:30 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Retrieving all row keys of a CF

A few questions:


1)  What is the heap size and total memory on each node?

2)  How big is the cluster?

3)  What are the read and range timeouts (in cassandra.yaml) on the C* 
nodes?

4)  What are the timeouts for the Astyanax client?

5)  Do you see GC pressure on the C* nodes? How long does GC for new gen 
and old gen take?

6)  Does any node crash with OOM error when you try AllRowsReader?

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 4:14 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Retrieving all row keys of a CF

Hi,
I and Ruchir tried query using AllRowsReader recipe but had no luck. We are 
seeing PoolTimeoutException.
SEVERE: [Thread_1] Error reading RowKeys
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: 
PoolTimeoutException: [host=servername, latency=2003(2003), attempts=4]Timed 
out waiting for connection
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:198)
   at 
com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.borrowConnection(RoundRobinExecuteWithFailover.java:84)
   at 
com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:117)
   at 
com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:338)
   at 
com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$2.execute(ThriftColumnFamilyQueryImpl.java:397)
   at 
com.netflix.astyanax.recipes.reader.AllRowsReader$1.call(AllRowsReader.java:447)
   at 
com.netflix.astyanax.recipes.reader.AllRowsReader$1.call(AllRowsReader.java:419)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

We did receive a portion of data which changes on every try. We used following 
method.
boolean result = new AllRowsReader.Builder(keyspace, 
CF_STANDARD1)
.withColumnRange(null, null, false, 0)
.withPartitioner(null) // this will use keyspace's partitioner
.forEachRow(new Function, Boolean>() {
@Override
public Boolean apply(@Nullable Row row) {
// Process the row here ...
return true;
}
})
.build()
.call();

Tried setting concurrency level as mentioned in this post 
(https://github.com/Netflix/astyanax/issues/411) as well on both astyanax 
1.56.49 and 2.0.0. Still nothing.


RE: sharding vs what cassandra does

2015-01-19 Thread Mohammed Guller
Partitioning is similar to sharding.

Mohammed

From: Adaryl "Bob" Wakefield, MBA [mailto:adaryl.wakefi...@hotmail.com]
Sent: Monday, January 19, 2015 8:28 PM
To: user@cassandra.apache.org
Subject: sharding vs what cassandra does

It’s my understanding that the way Cassandra replicates data across nodes is 
NOT sharding. Can someone provide a better explanation or correct my 
understanding?
B.


RE: Retrieving all row keys of a CF

2015-01-22 Thread Mohammed Guller
What is the average and max # of CQL rows in each partition? Is 800,000 the 
number of CQL rows or Cassandra partitions (storage engine rows)?

Another option you could try is a CQL statement to fetch all partition keys. 
You could first try this in the cqlsh:

“SELECT DISTINCT pk1, pk2…pkn FROM CF”

You will need to specify all the composite columns if you are using a composite 
partition key.

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Thursday, January 22, 2015 1:57 PM
To: user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

Hi,
I increased range timeout, read timeout to first to 50 secs then 500 secs and 
Astyanax client to 60, 550 secs respectively. I still get timeout exception.
I see the logic with .withCheckpointManager() code, is that the only way it 
could work?


From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Saturday, January 17, 2015 9:55 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Retrieving all row keys of a CF

If you're getting partial data back, then failing eventually, try setting 
.withCheckpointManager() - this will let you keep track of the token ranges 
you've successfully processed, and not attempt to reprocess them.  This will 
also let you set up tasks on bigger data sets that take hours or days to run, 
and reasonably safely interrupt it at any time without losing progress.

This is some *very* old code, but I dug this out of a git history.  We don't 
use Astyanax any longer, but maybe an example implementation will help you.  
This is Scala instead of Java, but hopefully you can get the gist.

https://gist.github.com/MightyE/83a79b74f3a69cfa3c4e

If you're timing out talking to your cluster, then I don't recommend using the 
cluster to track your checkpoints, but some other data store (maybe just a 
flatfile).  Again, this is just to give you a sense of what's involved.

On Fri, Jan 16, 2015 at 6:31 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Both total system memory and heap size can’t be 8GB?

The timeout on the Astyanax client should be greater than the timeouts on the 
C* nodes, otherwise your client will timeout prematurely.

Also, have you tried increasing the timeout for the range queries to a higher 
number? It is not recommended to set them very high, because a lot of other 
problems may start happening, but then reading 800,000 partitions is not a 
normal operation.

Just as an experimentation, can you set the range timeout to 45 seconds on each 
node and the timeout on the Astyanax client to 50 seconds? Restart the nodes 
after increasing the timeout and try again.

Mohammed

From: Ravi Agrawal 
[mailto:ragra...@clearpoolgroup.com<mailto:ragra...@clearpoolgroup.com>]
Sent: Friday, January 16, 2015 5:11 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Retrieving all row keys of a CF


1)What is the heap size and total memory on each node? 8GB, 8GB
2)How big is the cluster? 4
3)What are the read and range timeouts (in cassandra.yaml) on the 
C* nodes? 10 secs, 10 secs
4)What are the timeouts for the Astyanax client? 2 secs
5)Do you see GC pressure on the C* nodes? How long does GC for new 
gen and old gen take? occurs every 5 secs dont see huge gc pressure, <50ms
6)Does any node crash with OOM error when you try AllRowsReader? No

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, January 16, 2015 7:30 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Retrieving all row keys of a CF

A few questions:


1)  What is the heap size and total memory on each node?

2)  How big is the cluster?

3)  What are the read and range timeouts (in cassandra.yaml) on the C* 
nodes?

4)  What are the timeouts for the Astyanax client?

5)  Do you see GC pressure on the C* nodes? How long does GC for new gen 
and old gen take?

6)  Does any node crash with OOM error when you try AllRowsReader?

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 4:14 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Retrieving all row keys of a CF

Hi,
I and Ruchir tried query using AllRowsReader recipe but had no luck. We are 
seeing PoolTimeoutException.
SEVERE: [Thread_1] Error reading RowKeys
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: 
PoolTimeoutException: [host=servername, latency=2003(2003), attempts=4]Timed 
out waiting for connection
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:198)
   at 
com.netflix.astyanax.connectionpool.impl.RoundRobinExecu

RE: Retrieving all row keys of a CF

2015-01-23 Thread Mohammed Guller
No wonder, the client is timing out. Even though C* supports up to 2B columns, 
it is recommended not to have more 100k CQL rows in a partition.

It has been a long time since I used Astyanax, so I don’t remember whether the 
AllRowsReader reads all CQL rows or storage rows. If it is reading all CQL 
rows, then essentially it is trying to read 800k*200k rows. That will be 160B 
rows!

Did you try “SELECT DISTINCT …” from cqlsh?

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Thursday, January 22, 2015 11:12 PM
To: user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

In each partition cql rows on average is 200K. Max is 3M.
800K is number of cassandra partitions.


From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Thursday, January 22, 2015 7:43 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Retrieving all row keys of a CF

What is the average and max # of CQL rows in each partition? Is 800,000 the 
number of CQL rows or Cassandra partitions (storage engine rows)?

Another option you could try is a CQL statement to fetch all partition keys. 
You could first try this in the cqlsh:

“SELECT DISTINCT pk1, pk2…pkn FROM CF”

You will need to specify all the composite columns if you are using a composite 
partition key.

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Thursday, January 22, 2015 1:57 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Retrieving all row keys of a CF

Hi,
I increased range timeout, read timeout to first to 50 secs then 500 secs and 
Astyanax client to 60, 550 secs respectively. I still get timeout exception.
I see the logic with .withCheckpointManager() code, is that the only way it 
could work?


From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Saturday, January 17, 2015 9:55 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Retrieving all row keys of a CF

If you're getting partial data back, then failing eventually, try setting 
.withCheckpointManager() - this will let you keep track of the token ranges 
you've successfully processed, and not attempt to reprocess them.  This will 
also let you set up tasks on bigger data sets that take hours or days to run, 
and reasonably safely interrupt it at any time without losing progress.

This is some *very* old code, but I dug this out of a git history.  We don't 
use Astyanax any longer, but maybe an example implementation will help you.  
This is Scala instead of Java, but hopefully you can get the gist.

https://gist.github.com/MightyE/83a79b74f3a69cfa3c4e

If you're timing out talking to your cluster, then I don't recommend using the 
cluster to track your checkpoints, but some other data store (maybe just a 
flatfile).  Again, this is just to give you a sense of what's involved.

On Fri, Jan 16, 2015 at 6:31 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Both total system memory and heap size can’t be 8GB?

The timeout on the Astyanax client should be greater than the timeouts on the 
C* nodes, otherwise your client will timeout prematurely.

Also, have you tried increasing the timeout for the range queries to a higher 
number? It is not recommended to set them very high, because a lot of other 
problems may start happening, but then reading 800,000 partitions is not a 
normal operation.

Just as an experimentation, can you set the range timeout to 45 seconds on each 
node and the timeout on the Astyanax client to 50 seconds? Restart the nodes 
after increasing the timeout and try again.

Mohammed

From: Ravi Agrawal 
[mailto:ragra...@clearpoolgroup.com<mailto:ragra...@clearpoolgroup.com>]
Sent: Friday, January 16, 2015 5:11 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Retrieving all row keys of a CF


1)What is the heap size and total memory on each node? 8GB, 8GB
2)How big is the cluster? 4
3)What are the read and range timeouts (in cassandra.yaml) on the 
C* nodes? 10 secs, 10 secs
4)What are the timeouts for the Astyanax client? 2 secs
5)Do you see GC pressure on the C* nodes? How long does GC for new 
gen and old gen take? occurs every 5 secs dont see huge gc pressure, <50ms
6)Does any node crash with OOM error when you try AllRowsReader? No

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, January 16, 2015 7:30 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Retrieving all row keys of a CF

A few questions:


1)  What is the heap size and total memory on each node?

2)  How big is the cluster?

3)  What are the read and range timeouts (in cassandra.yaml) on the C* 
nodes?

4)  What are the timeouts for the Astyanax client?

5)  Do you see GC pressure on the C* nodes? How long does GC for new 

RE: Controlling the MAX SIZE of sstables after compaction

2015-01-27 Thread Mohammed Guller
I believe Aegisthus is open sourced.

Mohammed

From: Jan [mailto:cne...@yahoo.com]
Sent: Monday, January 26, 2015 11:20 AM
To: user@cassandra.apache.org
Subject: Re: Controlling the MAX SIZE of sstables after compaction

Parth  et al;

the folks at Netflix seem to have built a solution for your problem.
The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline out of 
Cassandra





[image]











The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline 
...
By Charles Smith and Jeff Magnusson


View on 
techblog.netflix.com

Preview by Yahoo






May want to chase Jeff Magnuson & check if the solution is open sourced.
Pl.   report back to this forum if you get an answer to the problem.

hope this helps.
Jan

C* Architect

On Monday, January 26, 2015 11:25 AM, Robert Coli 
mailto:rc...@eventbrite.com>> wrote:

On Sun, Jan 25, 2015 at 10:40 PM, Parth Setya 
mailto:setya.pa...@gmail.com>> wrote:
1. Is there a way to configure the size of sstables created after compaction?

No, won'tfix : 
https://issues.apache.org/jira/browse/CASSANDRA-4897.

You could use the "sstablesplit" utility on your One Big SSTable to split it 
into files of your preferred size.

2. Is there a better approach to generate the report?

The major compaction isn't too bad, but something that understands SSTables as 
an input format would be preferable to sstable2json.

3. What are the flaws with this approach?

sstable2json is slow and transforms your data to JSON.

=Rob



full-tabe scan - extracting all data from C*

2015-01-27 Thread Mohammed Guller
Hi -

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don't have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don't think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn't really 
work if you have a large amount of data in C*.

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

Mohammed



RE: Re:full-tabe scan - extracting all data from C*

2015-01-27 Thread Mohammed Guller
How big is your table? How much data does it have?

Mohammed

From: Xu Zhongxing [mailto:xu_zhong_x...@163.com]
Sent: Tuesday, January 27, 2015 5:34 PM
To: user@cassandra.apache.org
Subject: Re:full-tabe scan - extracting all data from C*

Both Java driver "select * from table" and Spark sc.cassandraTable() work well.
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" 
mailto:moham...@glassbeam.com>> wrote:

Hi -

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don't have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don't think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn't really 
work if you have a large amount of data in C*.

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

Mohammed



RE: Tombstone gc after gc grace seconds

2015-01-29 Thread Mohammed Guller
Ravi –

It may help.

What version are you running? Do you know if minor compaction is getting 
triggered at all? One way to check would be see how many sstables the data 
directory has.

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Thursday, January 29, 2015 1:29 PM
To: user@cassandra.apache.org
Subject: RE: Tombstone gc after gc grace seconds

Hi,
I saw there are 2 more interesting parameters –

a.   tombstone_threshold - A ratio of garbage-collectable tombstones to all 
contained columns, which if exceeded by the SSTable triggers compaction (with 
no other SSTables) for the purpose of purging the tombstones. Default value – 
0.2

b.  unchecked_tombstone_compaction - True enables more aggressive than 
normal tombstone compactions. A single SSTable tombstone compaction runs 
without checking the likelihood of success. Cassandra 2.0.9 and later.
Could I use these to get what I want?
Problem I am encountering is even long after gc_grace_seconds I see no 
reduction in disk space until I run compaction manually. I was thinking to make 
tombstone threshold close to 0 and unchecked compaction set to true.
Also we are not running nodetool repair on weekly basis as of now.

From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Monday, January 26, 2015 12:11 PM
To: user@cassandra.apache.org
Subject: Re: Tombstone gc after gc grace seconds

My understanding is consistent with Alain's, there's no way to force a 
tombstone-only compaction, your only option is major compaction.  If you're 
using size tiered, that comes with its own drawbacks.

I wonder if there's a technical limitation that prevents introducing a shadowed 
data cleanup style operation (overwritten data, including deletes, plus 
tombstones past their gc grace period); or maybe even couple it directly with 
cleanup since most of the work (rewriting old SSTables) would be identical.  I 
can't think of something off the top of my head, but it would be so useful that 
it seems like there's got to be something I'm missing.

On Mon, Jan 26, 2015 at 4:15 AM, Alain RODRIGUEZ 
mailto:arodr...@gmail.com>> wrote:
I don't think that such a thing exists as SSTables are immutable. You compact 
it entirely or you don't. Minor compaction will eventually evict tombstones. If 
it is too slow, AFAIK, the "better" solution is a major compaction.

C*heers,

Alain

2015-01-23 0:00 GMT+01:00 Ravi Agrawal 
mailto:ragra...@clearpoolgroup.com>>:
Hi,
I want to trigger just tombstone compaction after gc grace seconds is completed 
not nodetool compact keyspace column family.
Anyway I can do that?

Thanks






RE: Smart column searching for a particular rowKey

2015-02-03 Thread Mohammed Guller
Astyanax allows you to execute CQL statements. I don’t remember the details, 
but it is there.

One tip – when you create the column family, use CLUSTERING ORDER WITH 
(timestamp DESC). Then you query becomes straightforward and C* will do all the 
heavy lifting for you.

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Tuesday, February 3, 2015 11:54 AM
To: user@cassandra.apache.org
Subject: RE: Smart column searching for a particular rowKey

Cannot find something corresponding to where clause there.

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Tuesday, February 03, 2015 2:44 PM
To: user@cassandra.apache.org
Subject: RE: Smart column searching for a particular rowKey

Thanks, it does.
How about in astyanax?

From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Tuesday, February 03, 2015 1:49 PM
To: user@cassandra.apache.org
Subject: Re: Smart column searching for a particular rowKey

WHERE < + ORDER DESC + LIMIT should be able to accomplish that.

On Tue, Feb 3, 2015 at 11:28 AM, Ravi Agrawal 
mailto:ragra...@clearpoolgroup.com>> wrote:
Hi Guys,
Need help with this.
My rowKey is stockName like GOOGLE, APPLE.
Columns are sorted as per timestamp and they include some set of data fields 
like price and size. So, data would be like 1. 9:31:00, $520, 100 shares 2. 
9:35:09, $530, 1000 shares 3. 9:45:39, $520, 500 shares
I want to search this column family using partition key timestamp.
For a rowkey, if I search for data on partition id 9:33:00 which does not 
actually exist in columns, I want to return the last value where data was 
present. In this case 9:31:00, $520, 100 shares, since the next partitionkey is 
9:35:09 which is greater than input value entered.
One obvious way would be iterating through each columns and storing last data, 
if new timestamp is greater than given timestamp then return the last data 
stored.
Is it any optimized way to achieve the same? Since columns are already sorted.
Thanks





RE: Data tiered compaction and data model question

2015-02-18 Thread Mohammed Guller
What is the maximum number of events that you expect in a day? What is the 
worst-case scenario?

Mohammed

From: cass savy [mailto:casss...@gmail.com]
Sent: Wednesday, February 18, 2015 4:21 PM
To: user@cassandra.apache.org
Subject: Data tiered compaction and data model question

We want to track events in log  Cf/table and should be able to query for events 
that occurred in range of mins or hours for given day. Multiple events can 
occur in a given minute.  Listed 2 table designs and leaning towards table 1 to 
avoid large wide row.  Please advice on

Table 1: not very widerow, still be able to query for range of minutes for 
given day
and/or given day and range of hours
Create table log_Event
(
 event_day text,
 event_hr int,
 event_time timeuuid,
 data text,
PRIMARY KEY ( (event_day,event_hr),event_time)
)
Table 2: This will be very wide row

Create table log_Event
( event_day text,
 event_time timeuuid,
 data text,
PRIMARY KEY ( event_day,event_time)
)

Datatiered compaction: recommended for time series data as per below doc. Our 
data will be kept only for 30 days. Hence thought of using this compaction 
strategy.
http://www.datastax.com/dev/blog/datetieredcompactionstrategy
Create table 1 listed above with this compaction strategy. Added some rows and 
did manual flush.  I do not see any sstables created yet. Is that expected?
 compaction={'max_sstable_age_days': '1', 'class': 
'DateTieredCompactionStrategy'}



RE: Data tiered compaction and data model question

2015-02-19 Thread Mohammed Guller
Reading 288,000 rows from a partition may cause problems. It is recommended not 
to read more than 100k rows in a partition ((although paging may help). So 
Table 2 may cause issues.

I agree with Kai that for you may not even need C* for this use-case. C* is 
ideal for data with  3 Vs: volume, velocity and variety. It doesn’t look like 
your data has the volume or velocity that a standard RDBMS cannot handle.

Mohammed

From: Kai Wang [mailto:dep...@gmail.com]
Sent: Thursday, February 19, 2015 6:06 AM
To: user@cassandra.apache.org
Subject: Re: Data tiered compaction and data model question

What's the typical size of the data field? Unless it's very large, I don't 
think table 2 is a "very" wide row (10x20x60x24=288000 events/partition at 
worst). Plus you only need to store 30 days of data. The over data size is 
288000x30=8,640,000 events. I am not even sure if you need C* depending on 
event size.

On Thu, Feb 19, 2015 at 12:00 AM, cass savy 
mailto:casss...@gmail.com>> wrote:
10-20 per minute is the average. Worstcase can be 10x of avg.

On Wed, Feb 18, 2015 at 4:49 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
What is the maximum number of events that you expect in a day? What is the 
worst-case scenario?

Mohammed

From: cass savy [mailto:casss...@gmail.com<mailto:casss...@gmail.com>]
Sent: Wednesday, February 18, 2015 4:21 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Data tiered compaction and data model question

We want to track events in log  Cf/table and should be able to query for events 
that occurred in range of mins or hours for given day. Multiple events can 
occur in a given minute.  Listed 2 table designs and leaning towards table 1 to 
avoid large wide row.  Please advice on

Table 1: not very widerow, still be able to query for range of minutes for 
given day
and/or given day and range of hours
Create table log_Event
(
 event_day text,
 event_hr int,
 event_time timeuuid,
 data text,
PRIMARY KEY ( (event_day,event_hr),event_time)
)
Table 2: This will be very wide row

Create table log_Event
( event_day text,
 event_time timeuuid,
 data text,
PRIMARY KEY ( event_day,event_time)
)

Datatiered compaction: recommended for time series data as per below doc. Our 
data will be kept only for 30 days. Hence thought of using this compaction 
strategy.
http://www.datastax.com/dev/blog/datetieredcompactionstrategy
Create table 1 listed above with this compaction strategy. Added some rows and 
did manual flush.  I do not see any sstables created yet. Is that expected?
 compaction={'max_sstable_age_days': '1', 'class': 
'DateTieredCompactionStrategy'}





Spark SQL Thrift JDBC/ODBC server + Cassandra

2015-04-07 Thread Mohammed Guller
Hi -

Is anybody using Cassandra with the Spark SQL Thrift JDBC/ODBC server? I can 
programmatically (within our app) use Spark SQL with C* using the 
Spark-Cassandra-Connector, but can't find any documentation on how to query C* 
through the Spark SQL Thrift JDBC/ODBC server. Would appreciate if you can 
point me to where I can find some documentation on this topic.

Thanks.

Mohammed




Spark SQL JDBC Server + DSE

2015-05-26 Thread Mohammed Guller
Hi -
As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open 
source C*. Only DSE supports  the Spark SQL JDBC server.

We would like to find out whether how many organizations are using this 
combination. If you do use DSE + Spark SQL JDBC server, it would be great if 
you could share your experience. For example, what kind of issues you have run 
into? How is the performance? What reporting tools you are using?

Thank  you!

Mohammed



RE: Spark SQL JDBC Server + DSE

2015-05-28 Thread Mohammed Guller
Anybody out there using DSE + Spark SQL JDBC server?

Mohammed

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Tuesday, May 26, 2015 6:17 PM
To: user@cassandra.apache.org
Subject: Spark SQL JDBC Server + DSE

Hi -
As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open 
source C*. Only DSE supports  the Spark SQL JDBC server.

We would like to find out whether how many organizations are using this 
combination. If you do use DSE + Spark SQL JDBC server, it would be great if 
you could share your experience. For example, what kind of issues you have run 
into? How is the performance? What reporting tools you are using?

Thank  you!

Mohammed



RE: Spark SQL JDBC Server + DSE

2015-05-29 Thread Mohammed Guller
Brian,
I implemented a similar REST server last year and it works great. Now we have a 
requirement to support JDBC connectivity in addition to the REST API. We want 
to allow users to use tools like Tableau to connect to C* through the Spark SQL 
JDBC/Thift server.

Mohammed

From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
Sent: Thursday, May 28, 2015 6:16 PM
To: user@cassandra.apache.org
Subject: Re: Spark SQL JDBC Server + DSE

Mohammed,

This doesn't really answer your question, but I'm working on a new REST server 
that allows people to submit SQL queries over REST, which get executed via 
Spark SQL.   Based on what I started here:
http://brianoneill.blogspot.com/2015/05/spark-sql-against-cassandra-example.html

I assume you need JDBC connectivity specifically?

-brian

---
Brian O'Neill
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile * @boneill42<http://www.twitter.com/boneill42>

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.


From: Mohammed Guller mailto:moham...@glassbeam.com>>
Reply-To: mailto:user@cassandra.apache.org>>
Date: Thursday, May 28, 2015 at 8:26 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: RE: Spark SQL JDBC Server + DSE

Anybody out there using DSE + Spark SQL JDBC server?

Mohammed

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Tuesday, May 26, 2015 6:17 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Spark SQL JDBC Server + DSE

Hi -
As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open 
source C*. Only DSE supports  the Spark SQL JDBC server.

We would like to find out whether how many organizations are using this 
combination. If you do use DSE + Spark SQL JDBC server, it would be great if 
you could share your experience. For example, what kind of issues you have run 
into? How is the performance? What reporting tools you are using?

Thank  you!

Mohammed



RE: Spark SQL JDBC Server + DSE

2015-06-01 Thread Mohammed Guller
Brian,
We haven't open sourced the REST server, but not  opposed to doing it. Just 
need to carve out some time to clean up the code and carve it out from all the 
other stuff that we do in that REST server.  Will try to do it in the next few 
weeks. If you need it sooner, let me know.

I did consider the option of writing our own Spark SQL JDBC driver for C*, but 
it is lower on the priority list right now.

Mohammed

From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
Sent: Saturday, May 30, 2015 3:12 AM
To: user@cassandra.apache.org
Subject: Re: Spark SQL JDBC Server + DSE


Any chance you open-sourced, or could open-source the REST server? ;)

In thinking about it...
It doesn't feel like it would be that hard to write a Spark SQL JDBC driver 
against Cassandra, akin to what they have for hive:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

I wouldn't mind collaborating on that, if you are headed in that direction.
(and then I could write the REST server on top of that)

LMK,

-brian

---
Brian O'Neill
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile * @boneill42<http://www.twitter.com/boneill42>

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.


From: Mohammed Guller mailto:moham...@glassbeam.com>>
Reply-To: mailto:user@cassandra.apache.org>>
Date: Friday, May 29, 2015 at 2:15 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: RE: Spark SQL JDBC Server + DSE

Brian,
I implemented a similar REST server last year and it works great. Now we have a 
requirement to support JDBC connectivity in addition to the REST API. We want 
to allow users to use tools like Tableau to connect to C* through the Spark SQL 
JDBC/Thift server.

Mohammed

From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
Sent: Thursday, May 28, 2015 6:16 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Spark SQL JDBC Server + DSE

Mohammed,

This doesn't really answer your question, but I'm working on a new REST server 
that allows people to submit SQL queries over REST, which get executed via 
Spark SQL.   Based on what I started here:
http://brianoneill.blogspot.com/2015/05/spark-sql-against-cassandra-example.html

I assume you need JDBC connectivity specifically?

-brian

---
Brian O'Neill
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile * @boneill42<http://www.twitter.com/boneill42>

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.


From: Mohammed Guller mailto:moham...@glassbeam.com>>
Reply-To: mailto:user@cassandra.apache.org>>
Date: Thursday, May 28, 2015 at 8:26 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: RE: Spark SQL JDBC Server + DSE

Anybody out there using DSE + Spark SQL JDBC server?

Mohammed

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Tuesday, May 26, 2015 6:17 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Spark SQL JDBC Server + DSE

Hi -
As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open 
source C*. Only DSE supports  the Spark SQL JDBC server.

We would like to find out whether how many organizations are using this 
combination. If you do use DSE + Spark SQL JDBC server, it would be great if 
you could share your experience. For example, what kind of issues you have run 
into? How is the performance? What reporting tools you are using?

Thank  you!

Mohammed



RE: Cassandra 2.2, 3.0, and beyond

2015-06-11 Thread Mohammed Guller
Considering that 2.1.6 was just released and it is the first “stable” release 
ready for production in the 2.1 series, won’t it be too soon to EOL 2.1.x when 
3.0 comes out in September?

Mohammed

From: Jonathan Ellis [mailto:jbel...@gmail.com]
Sent: Thursday, June 11, 2015 10:14 AM
To: user
Subject: Re: Cassandra 2.2, 3.0, and beyond

As soon as 8099 is done.

On Thu, Jun 11, 2015 at 11:53 AM, Pierre Devops 
mailto:pierredev...@gmail.com>> wrote:
Hi,

3.x beta release date ?

2015-06-11 16:21 GMT+02:00 Jonathan Ellis 
mailto:jbel...@gmail.com>>:
3.1 is EOL as soon as 3.3 (the next bug fix release) comes out.

On Thu, Jun 11, 2015 at 4:10 AM, Stefan Podkowinski 
mailto:stefan.podkowin...@1und1.de>> wrote:
> We are also extending our backwards compatibility policy to cover all 3.x 
> releases: you will be able to upgrade seamlessly from 3.1 to 3.7, for 
> instance, including cross-version repair.

What will be the EOL policy for releases after 3.0? Given your example, will 
3.1 still see bugfixes at this point when I decide to upgrade to 3.7?



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced




--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced


RE: Cassandra 2.2, 3.0, and beyond

2015-06-11 Thread Mohammed Guller
By that logic, 2.1.0  should have been somewhat as stable as 2.0.10 (the last 
release of 2.0.x branch before 2.1.0). However, we found out that it took 
almost 9 months for 2.1.x series to become stable and suitable for production. 
Going by past history, I am worried that it may take the same time for 2.2 to 
become stable.

Mohammed

From: graham sanderson [mailto:gra...@vast.com]
Sent: Thursday, June 11, 2015 6:34 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra 2.2, 3.0, and beyond

I think the point is that 2.2 will replace 2.1.x + (i.e. the done/safe bits of 
3.0 are included in 2.2).. so 2.2.x and 2.1.x are somewhat synonymous.

On Jun 11, 2015, at 8:14 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:

Considering that 2.1.6 was just released and it is the first “stable” release 
ready for production in the 2.1 series, won’t it be too soon to EOL 2.1.x when 
3.0 comes out in September?

Mohammed

From: Jonathan Ellis [mailto:jbel...@gmail.com]
Sent: Thursday, June 11, 2015 10:14 AM
To: user
Subject: Re: Cassandra 2.2, 3.0, and beyond

As soon as 8099 is done.

On Thu, Jun 11, 2015 at 11:53 AM, Pierre Devops 
mailto:pierredev...@gmail.com>> wrote:
Hi,

3.x beta release date ?

2015-06-11 16:21 GMT+02:00 Jonathan Ellis 
mailto:jbel...@gmail.com>>:
3.1 is EOL as soon as 3.3 (the next bug fix release) comes out.

On Thu, Jun 11, 2015 at 4:10 AM, Stefan Podkowinski 
mailto:stefan.podkowin...@1und1.de>> wrote:
> We are also extending our backwards compatibility policy to cover all 3.x 
> releases: you will be able to upgrade seamlessly from 3.1 to 3.7, for 
> instance, including cross-version repair.

What will be the EOL policy for releases after 3.0? Given your example, will 
3.1 still see bugfixes at this point when I decide to upgrade to 3.7?



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com<http://www.datastax.com/>
@spyced




--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com<http://www.datastax.com/>
@spyced



RE: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Mohammed Guller
The plugin looks cool. Thank you for open sourcing it.

Does it support faceting and other Solr functionality?

Mohammed

From: Andres de la Peña [mailto:adelap...@stratio.com]
Sent: Friday, June 12, 2015 3:43 AM
To: user@cassandra.apache.org
Subject: Re: Lucene index plugin for Apache Cassandra

I really appreciate your interest

Well, the first recommendation is to not use it unless you need it, because a 
properly Cassandra denormalized model is almost always preferable to indexing. 
Lucene indexing is a good option when there is no viable denormalization 
alternative. This is the case of range queries over multiple dimensions, 
full-text search or maybe complex boolean predicates. It's also appropriate for 
Spark/Hadoop jobs mapping a small fraction of the total amount of rows in a 
certain table, if you can pay the cost of indexing.

Lucene indexes run inside C*, so users should closely monitor the amount of 
used memory. It's also a good idea to put the Lucene directory files in a 
separate disk to those used by C* itself. Additionally, you should consider 
that indexed tables write throughput will be appreciably reduced, maybe to a 
few thousands rows per second.

It's really hard to estimate the amount of resources needed by the index due to 
the great variety of indexing and querying ways that Lucene offers, so the only 
thing we can suggest is to empirically find the optimal setup for your use case.

2015-06-12 12:00 GMT+02:00 Carlos Rolo 
mailto:r...@pythian.com>>:
Seems like an interesting tool!
What operational recommendations would you make to users of this tool (Extra 
hardware capacity, extra metrics to monitor, etc)?

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolo
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.com

On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña 
mailto:adelap...@stratio.com>> wrote:
Unfortunately, we don't have published any benchmarks yet, but we have plans to 
do it as soon as possible. However, you can expect a similar behavior as those 
of Elasticsearch or Solr, with some overhead due to the need for indexing both 
the Cassandra's row key and the partition's token. You can also take a look at 
this 
presentation
 to see how cluster distribution is done.

2015-06-12 0:45 GMT+02:00 Ben Bromhead 
mailto:b...@instaclustr.com>>:
Looks awesome, do you have any examples/benchmarks of using these indexes for 
various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?

On 10 June 2015 at 09:08, Andres de la Peña 
mailto:adelap...@stratio.com>> wrote:
Hi all,

With the release of Cassandra 2.1.6, Stratio is glad to present its open source 
Lucene-based implementation of C* secondary 
indexes as a plugin that can 
be attached to Apache Cassandra. Before the above changes, Lucene index was 
distributed inside a fork of Apache Cassandra, with all the difficulties 
implied. As of now, the fork is discontinued and new users should use the 
recently created plugin, which maintains all the features of Stratio 
Cassandra.

Stratio's Lucene index extends Cassandra’s functionality to provide near 
real-time distributed search engine capabilities such as with ElasticSearch or 
Solr, including full text search capabilities, free multivariable search, 
relevance queries and field-based sorting. Each node indexes its own data, so 
high availability and scalability is guaranteed.

We hope this will be useful to the Apache Cassandra community.

Regards,

--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // 
@stratiobd



--

Ben Bromhead

Instaclustr | www.instaclustr.com | 
@instaclustr | (650) 284 9692



--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // 
@stratiobd



--





--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // @stratiobd


RE: Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Mohammed Guller
Hi Matthew,
It looks fine to me. I have built a similar service that allows a user to 
submit a query from a browser and returns the result in JSON format.

Another alternative is to leave a Spark shell or one of the notebooks (Spark 
Notebook, Zeppelin, etc.) session open and run queries from there. This model 
works only if people give you the queries to execute.

Mohammed

From: Matthew Johnson [mailto:matt.john...@algomi.com]
Sent: Friday, June 19, 2015 2:21 AM
To: user@cassandra.apache.org
Subject: Code review - Spark SQL command-line client for Cassandra

Hi all,

I have been struggling with Cassandra’s lack of adhoc query support (I know 
this is an anti-pattern of Cassandra, but sometimes management come over and 
ask me to run stuff and it’s impossible to explain that it will take me a while 
when it would take about 10 seconds in MySQL) so I have put together the 
following code snippet that bundles DataStax’s Cassandra Spark connector and 
allows you to submit Spark SQL to it, outputting the results in a text file.

Does anyone spot any obvious flaws in this plan?? (I have a lot more error 
handling etc in my code, but removed it here for brevity)

private void run(String sqlQuery) {
SparkContext scc = new SparkContext(conf);
CassandraSQLContext csql = new CassandraSQLContext(scc);
DataFrame sql = csql.sql(sqlQuery);
String folderName = "/tmp/output_" + System.currentTimeMillis();
LOG.info("Attempting to save SQL results in folder: " + folderName);
sql.rdd().saveAsTextFile(folderName);
LOG.info("SQL results saved");
}

public static void main(String[] args) {

String sparkMasterUrl = args[0];
String sparkHost = args[1];
String sqlQuery = args[2];

SparkConf conf = new SparkConf();
conf.setAppName("Java Spark SQL");
conf.setMaster(sparkMasterUrl);
conf.set("spark.cassandra.connection.host", sparkHost);

JavaSparkSQL app = new JavaSparkSQL(conf);

app.run(sqlQuery, printToConsole);
}

I can then submit this to Spark with ‘spark-submit’:


>  ./spark-submit --class com.algomi.spark.JavaSparkSQL --master 
> spark://sales3:7077 
> spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
> spark://sales3:7077 sales3 "select * from mykeyspace.operationlog"

It seems to work pretty well, so I’m pretty happy, but wondering why this isn’t 
common practice (at least I haven’t been able to find much about it on Google) 
– is there something terrible that I’m missing?

Thanks!
Matthew




RE: Cassandra Summit 2015 Roll Call!

2015-09-22 Thread Mohammed Guller
Hey everyone,
I will be at the summit too on Wed and Thu.  I am giving a talk on Thursday at 
2.40pm.

Would love to meet everyone on this list in person.  Here is an old picture of 
mine:
https://events.mfactormeetings.com/accounts/register123/mfactor/datastax/events/dstaxsummit2015/guller.jpg

Mohammed

From: Carlos Alonso [mailto:i...@mrcalonso.com]
Sent: Tuesday, September 22, 2015 5:23 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra Summit 2015 Roll Call!

Hi guys.

I'm already here and I'll be the whole Summit. I'll be doing a live demo on 
Thursday on troubleshooting Cassandra production issues as a developer.

This is me!! https://twitter.com/calonso/status/646352711454097408

Carlos Alonso | Software Engineer | @calonso

On 22 September 2015 at 15:27, Jeff Jirsa 
mailto:jeff.ji...@crowdstrike.com>> wrote:
I’m here. Will be speaking Wednesday on DTCS for time series workloads: 
http://cassandrasummit-datastax.com/agenda/real-world-dtcs-for-operators/

Picture if you recognize me, say hi: 
https://events.mfactormeetings.com/accounts/register123/mfactor/datastax/events/dstaxsummit2015/jirsa.jpg
 (probably wearing glasses and carrying a black Crowdstrike backpack)

- Jeff


From: Robert Coli
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, September 22, 2015 at 11:27 AM
To: "user@cassandra.apache.org"
Subject: Cassandra Summit 2015 Roll Call!

Cassandra Summit 2015 is upon us!

Every year, the conference gets bigger and bigger, and the chance of IRL 
meeting people you've "met" online gets smaller and smaller.

To improve everyone's chances, if you are attending the summit :

1) respond on-thread with a brief introduction (and physical description of 
yourself if you want others to be able to spot you!)
2) join #cassandra on freenode IRC (irc.freenode.org) 
to chat and connect with other attendees!

MY CONTRIBUTION :
--
I will be at the summit on Wednesday and Thursday. I am 5'8" or so, and will be 
wearing glasses and either a red or blue "Eventbrite Engineering" t-shirt with 
a graphic logo of gears on it. Come say hello! :D

=Rob




RE: reducing disk space consumption

2016-02-10 Thread Mohammed Guller
If I remember it correctly, C* creates a snapshot when you drop a keyspace. Run 
the following command to get rid of the snapshot:
nodetool clearsnapshot

Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 10, 2016 6:59 AM
To: user@cassandra.apache.org
Subject: reducing disk space consumption

Hi,
I am using DSE 4.8.4
On one node, disk space is low where:

42G /var/lib/cassandra/data/usertable/data-0abea7f0cf9211e5a355bf8dafbfa99c

Using CLI, I dropped keyspace usertable but the data dir above still consumes 
42G.

What action would free this part of disk (I don't need the data) ?

Thanks