Hi,

I am trying to load 1.6 mb excel file which has 16 tabs. We converted excel
to csv and loaded 16 csv files to 8 tables. Job was running successful in
1st run in pyspark. When trying to run the same job 2 time, container
getting killed due to memory issues.

I am using unpersist and clearcache on all rdds and dataframes after each
file loaded into table. Each csv file is loaded in sequence process ( for
loop) as some of the files should go to same table. Job will run 15 min if
it was success and 12-15 min if it was failed. If i increase the driver
memory and executor memory to more than 5 gb, its getting success.

My assumption is driver memory full, and unpersist clear cache not working.

Error: physical memory of 2 gb used and virtual memory of 4.6 gb used.

Spark 1.6 version running in Cloudera Enterprise .

Please let me know, if you need any info.


Thanks

Reply via email to