Apache Spark SQL is taking forever to count billion rows from Cassandra?

kant kodali Thu, 24 Nov 2016 00:04:29 -0800

I have the following code

I invoke spark-shell as follows


    ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
--executor-memory 15G --executor-cores 12 --conf
spark.cassandra.input.split.size_in_mb=67108864

code

    scala> val df = spark.sql("SELECT test from hello") // Billion rows in
hello and test column is 1KB

    df: org.apache.spark.sql.DataFrame = [test: binary]

    scala> df.count

    [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean
precisely.

If I invoke spark-shell as follows

    ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134

code


    val df = spark.sql("SELECT test from hello") // This has about billion
rows

    scala> df.count


    [Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?


Both of these versions didn't work Spark keeps running forever and I have
been waiting for more than 15 mins and no response. Any ideas on what could
be wrong and how to fix this?

I am using Spark 2.0.2
and spark-cassandra-connector_2.11-2.0.0-M3.jar

Apache Spark SQL is taking forever to count billion rows from Cassandra?

Reply via email to