Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

kant kodali Thu, 24 Nov 2016 00:58:37 -0800

I would be glad if SELECT COUNT(*) FROM hello can return any value for that
size :) I can say for sure it didn't return anything for 30 mins and I
probably need to build more patience to sit for few more hours after that!
Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which
will give a pretty good estimate but not an accurate value.


On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com>
wrote:

> How fast is Cassandra without Spark on the count operation?
>
> cqsh> SELECT COUNT(*) FROM hello
>
> (this is not equivalent with what you are doing but might help you find
> the root of the cause)
>
> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> I have the following code
>>
>> I invoke spark-shell as follows
>>
>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>> --executor-memory 15G --executor-cores 12 --conf
>> spark.cassandra.input.split.size_in_mb=67108864
>>
>> code
>>
>>     scala> val df = spark.sql("SELECT test from hello") // Billion rows
>> in hello and test column is 1KB
>>
>>     df: org.apache.spark.sql.DataFrame = [test: binary]
>>
>>     scala> df.count
>>
>>     [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean
>> precisely.
>>
>> If I invoke spark-shell as follows
>>
>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>
>> code
>>
>>
>>     val df = spark.sql("SELECT test from hello") // This has about
>> billion rows
>>
>>     scala> df.count
>>
>>
>>     [Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?
>>
>>
>> Both of these versions didn't work Spark keeps running forever and I have
>> been waiting for more than 15 mins and no response. Any ideas on what could
>> be wrong and how to fix this?
>>
>> I am using Spark 2.0.2
>> and spark-cassandra-connector_2.11-2.0.0-M3.jar
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <a...@zurich.ibm.com>
>

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

Reply via email to