up vote down votefavorite <http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#>
Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel. Here is my code using spark-shell import org.apache.spark.sql._ import org.apache.spark.sql.types.StringType spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown "true")""") val df = spark.sql("SELECT test from hello") val df2 = df.select(df("test").cast(StringType).as("test")) val rdd = df2.rdd.map { case Row(j: String) => j } val df4 = spark.read.json(rdd) // This line takes forever I have about 700 million rows each row is about 1KB and this line val df4 = spark.read.json(rdd) takes forever as I get the following output after 1hr 30 mins Stage 1:==========> (4866 + 2) / 25256] so at this rate it will probably take days. I measured the network throughput rate of spark worker nodes using iftop and it is about 2.2KB/s (kilobytes per second) which is too low so that tells me it not reading partitions in parallel or at very least it is not reading good chunk of data else it would be in MB/s. Any ideas on how to fix it?