Hi I have the following code which fires hiveContext.sql() most of the time. My task is I want to create few table and insert values into after processing for all hive table partition. So I first fire show partitions and using its output in a for loop I call few methods which creates table if not exists and does insert into using hiveContext.sql. Now we cant execute hiveContext in executor so I have to execute this for loop in driver program and should run serially one by one. When I submit this Spark job in YARN cluster almost all the time my executor gets lost because of shuffle not found exception. Now this is happening because YARN is killing my executor because of memory overload. I dont understand why I have very less data set for each hive partition but still it causes YARN to kill my executor. Please guide why the following code is overkill memory will the following code do everything in parallel and try to accommodate all hive partition data in memory at the same time? Please guide I am blocked because of this issue.
public static void main(String[] args) throws IOException { SparkConf conf = new SparkConf(); SparkContext sc = new SparkContext(conf); HiveContext hc = new HiveContext(sc); DataFrame partitionFrame = hiveContext.sql(" show partitions dbdata partition(date="2015-08-05")"); Row[] rowArr = partitionFrame.collect(); for(Row row : rowArr) { String[] splitArr = row.getString(0).split("/"); String server = splitArr[0].split("=")[1]; String date = splitArr[1].split("=")[1]; String csvPath = "hdfs:///user/db/ext/"+server+".csv"; if(fs.exists(new Path(csvPath))) { hiveContext.sql("ADD FILE " + csvPath); } createInsertIntoTableABC(hc,entity, date); createInsertIntoTableDEF(hc,entity, date); createInsertIntoTableGHI(hc,entity,date); createInsertIntoTableJKL(hc,entity, date); createInsertIntoTableMNO(hc,entity,date); } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-to-call-hiveContext-in-DataFrame-foreach-in-executor-program-or-how-to-have-a-for-loom-tp24141.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org