Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

kant kodali Sat, 26 Nov 2016 11:49:12 -0800

https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameReader.html#json(org.apache.spark.rdd.RDD)


You can pass a rdd to spark.read.json. // Spark here is SparkSession

Also it works completely fine with smaller dataset in a table but with 1B
records it takes forever and more importantly the network throughput is
only 2.2 KB/s which is too low. It should be somewhere in MB/s

On Sat, Nov 26, 2016 at 10:09 AM, Anastasios Zouzias <zouz...@gmail.com>
wrote:

> Hi there,
>
> spark.read.json usually takes a filesystem path (usually HDFS) where there
> is a file containing JSON per new line. See also
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Hence, in your case
>
> val df4 = spark.read.json(rdd) // This line takes forever
>
> seems wrong. I guess you might want to first store rdd as a text file on
> HDFS and then read it using spark.read.json .
>
> Cheers,
> Anastasios
>
>
>
> On Sat, Nov 26, 2016 at 9:34 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> up vote
>> down votefavorite
>> <http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#>
>>
>> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading
>> multiple partitions in parallel.
>>
>> Here is my code using spark-shell
>>
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.types.StringType
>> spark.sql("""CREATE TEMPORARY VIEW hello USING 
>> org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", 
>> cluster "Test Cluster", pushdown "true")""")
>> val df = spark.sql("SELECT test from hello")
>> val df2 = df.select(df("test").cast(StringType).as("test"))
>> val rdd = df2.rdd.map { case Row(j: String) => j }
>> val df4 = spark.read.json(rdd) // This line takes forever
>>
>> I have about 700 million rows each row is about 1KB and this line
>>
>> val df4 = spark.read.json(rdd) takes forever as I get the following
>> output after 1hr 30 mins
>>
>> Stage 1:==========> (4866 + 2) / 25256]
>>
>> so at this rate it will probably take days.
>>
>> I measured the network throughput rate of spark worker nodes using iftop
>> and it is about 2.2KB/s (kilobytes per second) which is too low so that
>> tells me it not reading partitions in parallel or at very least it is not
>> reading good chunk of data else it would be in MB/s. Any ideas on how to
>> fix it?
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <a...@zurich.ibm.com>
>

Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

Reply via email to