Hi,

I am reading a Hive Orc table into memory, StorageLevel is set to 
(StorageLevel.MEMORY_AND_DISK_SER)
Total size of the Hive table is 5GB
Started the spark-shell as below

spark-shell --master yarn --deploy-mode client --num-executors 8 
--driver-memory 5G --executor-memory 7G --executor-cores 2 --conf 
spark.yarn.executor.memoryOverhead=512
I have 10 node cluster each with 35 GB memory and 4 cores running on HDP 2.5
SPARK_LOCAL_DIRS location has enough space

My concern is below simple code to load data to memory takes approx. 10-12 mins.
If I change values for 
num-executors/driver-memory/executor-memory/executor-cores other than above 
mentioned I get "No space left on device" error
While running each nodes consumes varying size of memory from 7GB to 20 GB

import org.apache.spark.storage.StorageLevel


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
val tab1 =  sqlContext.sql("select * from 
xyz").repartition(150).persist(StorageLevel.MEMORY_AND_DISK_SER)
tab1.registerTempTable("AUDIT")
tab1.count()

kindly advice how to improve the performance of loading Hive table to Spark 
memory and avoid the space issue

Regards,
~Sri

Reply via email to