Hi, I am reading a Hive Orc table into memory, StorageLevel is set to (StorageLevel.MEMORY_AND_DISK_SER) Total size of the Hive table is 5GB Started the spark-shell as below
spark-shell --master yarn --deploy-mode client --num-executors 8 --driver-memory 5G --executor-memory 7G --executor-cores 2 --conf spark.yarn.executor.memoryOverhead=512 I have 10 node cluster each with 35 GB memory and 4 cores running on HDP 2.5 SPARK_LOCAL_DIRS location has enough space My concern is below simple code to load data to memory takes approx. 10-12 mins. If I change values for num-executors/driver-memory/executor-memory/executor-cores other than above mentioned I get "No space left on device" error While running each nodes consumes varying size of memory from 7GB to 20 GB import org.apache.spark.storage.StorageLevel val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SET hive.mapred.supports.subdirectories=true") sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true") val tab1 = sqlContext.sql("select * from xyz").repartition(150).persist(StorageLevel.MEMORY_AND_DISK_SER) tab1.registerTempTable("AUDIT") tab1.count() kindly advice how to improve the performance of loading Hive table to Spark memory and avoid the space issue Regards, ~Sri