Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
We have a 8 node Cassandra Cluster. Replication Strategy: 3 Consistency Level Quorum. Data Spread: I can let you know once I get access to our production cluster. The use case for simple count is more for internal use than say end clients/customers however there are many uses cases from customers

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread Jörn Franke
I am not sure what use case you want to demonstrate with select count in general. Maybe you can elaborate more what your use case is. Aside from this: this is a Cassandra issue. What is the setup of Cassandra? Dedicated nodes? How many? Replication strategy? Consistency configuration? How is

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
some accurate numbers here. so it took me 1hr:30 mins to count 698705723 rows (~700 Million) and my code is just this sc.cassandraTable("cuneiform", "blocks").cassandraCount On Thu, Nov 24, 2016 at 10:48 AM, kant kodali wrote: > Take a look at this

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
Take a look at this https://github.com/brianmhess/cassandra-count Now It is just matter of incorporating it into spark-cassandra-connector I guess. On Thu, Nov 24, 2016 at 1:01 AM, kant kodali wrote: > According to this link https://github.com/datastax/ >

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
According to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md I tried the following but it still looks like it is taking forever sc.cassandraTable(keyspace, table).cassandraCount On Thu, Nov 24, 2016 at 12:56 AM, kant kodali

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
I would be glad if SELECT COUNT(*) FROM hello can return any value for that size :) I can say for sure it didn't return anything for 30 mins and I probably need to build more patience to sit for few more hours after that! Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread Anastasios Zouzias
How fast is Cassandra without Spark on the count operation? cqsh> SELECT COUNT(*) FROM hello (this is not equivalent with what you are doing but might help you find the root of the cause) On Thu, Nov 24, 2016 at 9:03 AM, kant kodali wrote: > I have the following code > > I

Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
I have the following code I invoke spark-shell as follows ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864 code scala> val df = spark.sql("SELECT test from hello") //