Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameReader.html#json(org.apache.spark.rdd.RDD)

You can pass a rdd to spark.read.json. // Spark here is SparkSession

Also it works completely fine with smaller dataset in a table but with 1B
records it takes forever and more importantly the network throughput is
only 2.2 KB/s which is too low. It should be somewhere in MB/s

On Sat, Nov 26, 2016 at 10:09 AM, Anastasios Zouzias 
wrote:

> Hi there,
>
> spark.read.json usually takes a filesystem path (usually HDFS) where there
> is a file containing JSON per new line. See also
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Hence, in your case
>
> val df4 = spark.read.json(rdd) // This line takes forever
>
> seems wrong. I guess you might want to first store rdd as a text file on
> HDFS and then read it using spark.read.json .
>
> Cheers,
> Anastasios
>
>
>
> On Sat, Nov 26, 2016 at 9:34 AM, kant kodali  wrote:
>
>> up vote
>> down votefavorite
>> 
>>
>> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading
>> multiple partitions in parallel.
>>
>> Here is my code using spark-shell
>>
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.types.StringType
>> spark.sql("""CREATE TEMPORARY VIEW hello USING 
>> org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", 
>> cluster "Test Cluster", pushdown "true")""")
>> val df = spark.sql("SELECT test from hello")
>> val df2 = df.select(df("test").cast(StringType).as("test"))
>> val rdd = df2.rdd.map { case Row(j: String) => j }
>> val df4 = spark.read.json(rdd) // This line takes forever
>>
>> I have about 700 million rows each row is about 1KB and this line
>>
>> val df4 = spark.read.json(rdd) takes forever as I get the following
>> output after 1hr 30 mins
>>
>> Stage 1:==> (4866 + 2) / 25256]
>>
>> so at this rate it will probably take days.
>>
>> I measured the network throughput rate of spark worker nodes using iftop
>> and it is about 2.2KB/s (kilobytes per second) which is too low so that
>> tells me it not reading partitions in parallel or at very least it is not
>> reading good chunk of data else it would be in MB/s. Any ideas on how to
>> fix it?
>>
>>
>
>
> --
> -- Anastasios Zouzias
> 
>


Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread Anastasios Zouzias
Hi there,

spark.read.json usually takes a filesystem path (usually HDFS) where there
is a file containing JSON per new line. See also

http://spark.apache.org/docs/latest/sql-programming-guide.html

Hence, in your case

val df4 = spark.read.json(rdd) // This line takes forever

seems wrong. I guess you might want to first store rdd as a text file on
HDFS and then read it using spark.read.json .

Cheers,
Anastasios



On Sat, Nov 26, 2016 at 9:34 AM, kant kodali  wrote:

> up vote
> down votefavorite
> 
>
> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading
> multiple partitions in parallel.
>
> Here is my code using spark-shell
>
> import org.apache.spark.sql._
> import org.apache.spark.sql.types.StringType
> spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra 
> OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown 
> "true")""")
> val df = spark.sql("SELECT test from hello")
> val df2 = df.select(df("test").cast(StringType).as("test"))
> val rdd = df2.rdd.map { case Row(j: String) => j }
> val df4 = spark.read.json(rdd) // This line takes forever
>
> I have about 700 million rows each row is about 1KB and this line
>
> val df4 = spark.read.json(rdd) takes forever as I get the following
> output after 1hr 30 mins
>
> Stage 1:==> (4866 + 2) / 25256]
>
> so at this rate it will probably take days.
>
> I measured the network throughput rate of spark worker nodes using iftop
> and it is about 2.2KB/s (kilobytes per second) which is too low so that
> tells me it not reading partitions in parallel or at very least it is not
> reading good chunk of data else it would be in MB/s. Any ideas on how to
> fix it?
>
>


-- 
-- Anastasios Zouzias



Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
up vote
down votefavorite


Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading
multiple partitions in parallel.

Here is my code using spark-shell

import org.apache.spark.sql._
import org.apache.spark.sql.types.StringType
spark.sql("""CREATE TEMPORARY VIEW hello USING
org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db",
cluster "Test Cluster", pushdown "true")""")
val df = spark.sql("SELECT test from hello")
val df2 = df.select(df("test").cast(StringType).as("test"))
val rdd = df2.rdd.map { case Row(j: String) => j }
val df4 = spark.read.json(rdd) // This line takes forever

I have about 700 million rows each row is about 1KB and this line

val df4 = spark.read.json(rdd) takes forever as I get the following output
after 1hr 30 mins

Stage 1:==> (4866 + 2) / 25256]

so at this rate it will probably take days.

I measured the network throughput rate of spark worker nodes using iftop
and it is about 2.2KB/s (kilobytes per second) which is too low so that
tells me it not reading partitions in parallel or at very least it is not
reading good chunk of data else it would be in MB/s. Any ideas on how to
fix it?