Hi there, spark.read.json usually takes a filesystem path (usually HDFS) where there is a file containing JSON per new line. See also
http://spark.apache.org/docs/latest/sql-programming-guide.html Hence, in your case val df4 = spark.read.json(rdd) // This line takes forever seems wrong. I guess you might want to first store rdd as a text file on HDFS and then read it using spark.read.json . Cheers, Anastasios On Sat, Nov 26, 2016 at 9:34 AM, kant kodali <kanth...@gmail.com> wrote: > up vote > down votefavorite > <http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#> > > Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading > multiple partitions in parallel. > > Here is my code using spark-shell > > import org.apache.spark.sql._ > import org.apache.spark.sql.types.StringType > spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra > OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown > "true")""") > val df = spark.sql("SELECT test from hello") > val df2 = df.select(df("test").cast(StringType).as("test")) > val rdd = df2.rdd.map { case Row(j: String) => j } > val df4 = spark.read.json(rdd) // This line takes forever > > I have about 700 million rows each row is about 1KB and this line > > val df4 = spark.read.json(rdd) takes forever as I get the following > output after 1hr 30 mins > > Stage 1:==========> (4866 + 2) / 25256] > > so at this rate it will probably take days. > > I measured the network throughput rate of spark worker nodes using iftop > and it is about 2.2KB/s (kilobytes per second) which is too low so that > tells me it not reading partitions in parallel or at very least it is not > reading good chunk of data else it would be in MB/s. Any ideas on how to > fix it? > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>