up vote
down votefavorite
<http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#>

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading
multiple partitions in parallel.

Here is my code using spark-shell

import org.apache.spark.sql._
import org.apache.spark.sql.types.StringType
spark.sql("""CREATE TEMPORARY VIEW hello USING
org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db",
cluster "Test Cluster", pushdown "true")""")
val df = spark.sql("SELECT test from hello")
val df2 = df.select(df("test").cast(StringType).as("test"))
val rdd = df2.rdd.map { case Row(j: String) => j }
val df4 = spark.read.json(rdd) // This line takes forever

I have about 700 million rows each row is about 1KB and this line

val df4 = spark.read.json(rdd) takes forever as I get the following output
after 1hr 30 mins

Stage 1:==========> (4866 + 2) / 25256]

so at this rate it will probably take days.

I measured the network throughput rate of spark worker nodes using iftop
and it is about 2.2KB/s (kilobytes per second) which is too low so that
tells me it not reading partitions in parallel or at very least it is not
reading good chunk of data else it would be in MB/s. Any ideas on how to
fix it?

Reply via email to