Hi, I am trying to load 1.6 mb excel file which has 16 tabs. We converted excel to csv and loaded 16 csv files to 8 tables. Job was running successful in 1st run in pyspark. When trying to run the same job 2 time, container getting killed due to memory issues.
I am using unpersist and clearcache on all rdds and dataframes after each file loaded into table. Each csv file is loaded in sequence process ( for loop) as some of the files should go to same table. Job will run 15 min if it was success and 12-15 min if it was failed. If i increase the driver memory and executor memory to more than 5 gb, its getting success. My assumption is driver memory full, and unpersist clear cache not working. Error: physical memory of 2 gb used and virtual memory of 4.6 gb used. Spark 1.6 version running in Cloudera Enterprise . Please let me know, if you need any info. Thanks