Take a look at this https://github.com/brianmhess/cassandra-count
Now It is just matter of incorporating it into spark-cassandra-connector I guess. On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote: > According to this link https://github.com/datastax/ > spark-cassandra-connector/blob/master/doc/3_selection.md > > I tried the following but it still looks like it is taking forever > > sc.cassandraTable(keyspace, table).cassandraCount > > > On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.com> wrote: > >> I would be glad if SELECT COUNT(*) FROM hello can return any value for >> that size :) I can say for sure it didn't return anything for 30 mins and I >> probably need to build more patience to sit for few more hours after that! >> Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which >> will give a pretty good estimate but not an accurate value. >> >> On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com> >> wrote: >> >>> How fast is Cassandra without Spark on the count operation? >>> >>> cqsh> SELECT COUNT(*) FROM hello >>> >>> (this is not equivalent with what you are doing but might help you find >>> the root of the cause) >>> >>> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> wrote: >>> >>>> I have the following code >>>> >>>> I invoke spark-shell as follows >>>> >>>> ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 >>>> --executor-memory 15G --executor-cores 12 --conf >>>> spark.cassandra.input.split.size_in_mb=67108864 >>>> >>>> code >>>> >>>> scala> val df = spark.sql("SELECT test from hello") // Billion rows >>>> in hello and test column is 1KB >>>> >>>> df: org.apache.spark.sql.DataFrame = [test: binary] >>>> >>>> scala> df.count >>>> >>>> [Stage 0:> (0 + 2) / 13] // I dont know what these numbers mean >>>> precisely. >>>> >>>> If I invoke spark-shell as follows >>>> >>>> ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 >>>> >>>> code >>>> >>>> >>>> val df = spark.sql("SELECT test from hello") // This has about >>>> billion rows >>>> >>>> scala> df.count >>>> >>>> >>>> [Stage 0:=> (686 + 2) / 24686] // What are these numbers precisely? >>>> >>>> >>>> Both of these versions didn't work Spark keeps running forever and I >>>> have been waiting for more than 15 mins and no response. Any ideas on what >>>> could be wrong and how to fix this? >>>> >>>> I am using Spark 2.0.2 >>>> and spark-cassandra-connector_2.11-2.0.0-M3.jar >>>> >>>> >>> >>> >>> -- >>> -- Anastasios Zouzias >>> <a...@zurich.ibm.com> >>> >> >> >