Spark SQL performance and data size constraints

SK Wed, 26 Nov 2014 00:17:28 -0800

Hi,

I use the following code to read in data and extract the unique users using
Spark SQL. The data is 1.2 TB and I am running this on a cluster with 3 TB
memory. It appears that there is enough memory, but the program just freezes
after sometime where it maps the rdd to the case class Play.  (If I dont use
the Spark SQL portion (i.e dont map to the case class and register the table
etc.)  and merely load the data (first 3 lines of the code below) then the
program completes.)


I  tried with  spark.storage.memoryFraction=0.5 and 0.6 (default) as
suggested in the Tuning guide. but that did not help.
According to the logs, total # of partitions/tasks is 38688 and size of each
rdd partition for the mapping to the case class is around 31 MB. So total
rdd size is 38688*31 = 1.2 TB. This is less than the 3 TB memory on the
cluster. At the time the program stops, the total number of tasks is a
little < 38688 with some of them appearing as failed. There are no details
for why the tasks failed. 

Are there any maximum data size constraints in Spark SQL or table creation
that might be causing the program to hang? Are there any performance
optimizations I could try with Spark SQL that might allow the completion of
the task?


     val data = sc.textFile("shared_dir/*.dat")
                        .map(_.split("\t"))
                        .persist(MEMORY_AND_DISK_SER)


     val play =     data.map(f => Play(f(0).trim,f(1).trim, f(2).trim,
f(3).trim))
                               .persist(MEMORY_AND_DISK_SER)

     // register the RDD as a table
     play.registerTempTable("play")

     val ids = sql_cxt.sql("SELECT  DISTINCT id  FROM play")

     println("Number of unique account ID = %d".format(ids.count()))
     println("Number of RDDs = %d".format(play.count()))

thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-and-data-size-constraints-tp19843.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL performance and data size constraints

Reply via email to