Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameReader.html#json(org.apache.spark.rdd.RDD)

You can pass a rdd to spark.read.json. // Spark here is SparkSession

Also it works completely fine with smaller dataset in a table but with 1B
records it takes forever and more importantly the network throughput is
only 2.2 KB/s which is too low. It should be somewhere in MB/s

On Sat, Nov 26, 2016 at 10:09 AM, Anastasios Zouzias 
wrote:

> Hi there,
>
> spark.read.json usually takes a filesystem path (usually HDFS) where there
> is a file containing JSON per new line. See also
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Hence, in your case
>
> val df4 = spark.read.json(rdd) // This line takes forever
>
> seems wrong. I guess you might want to first store rdd as a text file on
> HDFS and then read it using spark.read.json .
>
> Cheers,
> Anastasios
>
>
>
> On Sat, Nov 26, 2016 at 9:34 AM, kant kodali  wrote:
>
>> up vote
>> down votefavorite
>> 
>>
>> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading
>> multiple partitions in parallel.
>>
>> Here is my code using spark-shell
>>
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.types.StringType
>> spark.sql("""CREATE TEMPORARY VIEW hello USING 
>> org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", 
>> cluster "Test Cluster", pushdown "true")""")
>> val df = spark.sql("SELECT test from hello")
>> val df2 = df.select(df("test").cast(StringType).as("test"))
>> val rdd = df2.rdd.map { case Row(j: String) => j }
>> val df4 = spark.read.json(rdd) // This line takes forever
>>
>> I have about 700 million rows each row is about 1KB and this line
>>
>> val df4 = spark.read.json(rdd) takes forever as I get the following
>> output after 1hr 30 mins
>>
>> Stage 1:==> (4866 + 2) / 25256]
>>
>> so at this rate it will probably take days.
>>
>> I measured the network throughput rate of spark worker nodes using iftop
>> and it is about 2.2KB/s (kilobytes per second) which is too low so that
>> tells me it not reading partitions in parallel or at very least it is not
>> reading good chunk of data else it would be in MB/s. Any ideas on how to
>> fix it?
>>
>>
>
>
> --
> -- Anastasios Zouzias
> 
>


Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread Anastasios Zouzias
Hi there,

spark.read.json usually takes a filesystem path (usually HDFS) where there
is a file containing JSON per new line. See also

http://spark.apache.org/docs/latest/sql-programming-guide.html

Hence, in your case

val df4 = spark.read.json(rdd) // This line takes forever

seems wrong. I guess you might want to first store rdd as a text file on
HDFS and then read it using spark.read.json .

Cheers,
Anastasios



On Sat, Nov 26, 2016 at 9:34 AM, kant kodali  wrote:

> up vote
> down votefavorite
> 
>
> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading
> multiple partitions in parallel.
>
> Here is my code using spark-shell
>
> import org.apache.spark.sql._
> import org.apache.spark.sql.types.StringType
> spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra 
> OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown 
> "true")""")
> val df = spark.sql("SELECT test from hello")
> val df2 = df.select(df("test").cast(StringType).as("test"))
> val rdd = df2.rdd.map { case Row(j: String) => j }
> val df4 = spark.read.json(rdd) // This line takes forever
>
> I have about 700 million rows each row is about 1KB and this line
>
> val df4 = spark.read.json(rdd) takes forever as I get the following
> output after 1hr 30 mins
>
> Stage 1:==> (4866 + 2) / 25256]
>
> so at this rate it will probably take days.
>
> I measured the network throughput rate of spark worker nodes using iftop
> and it is about 2.2KB/s (kilobytes per second) which is too low so that
> tells me it not reading partitions in parallel or at very least it is not
> reading good chunk of data else it would be in MB/s. Any ideas on how to
> fix it?
>
>


-- 
-- Anastasios Zouzias