Hi All,
I am getting Out Of Memory due to GC overhead while reading a table from
HIVE from spark like:
spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT
> 10").show()
So when I run above command in spark-shell then it starts processing *1780
tasks* where it goes OOM at a specific partition.
1. Table partition(*date='2019-05-14'*) is having *4000* files on HDFS so
ideally 4000 partitions should be created inside Spark Dataframe if I am
not wrong. I analyzed the table actually it is having total *1780*
partitions(means
1780 dates folder).
2. I checked the size of files in Table partition(*date='2019-05-14'*), max
file size is *1.1 GB* and I have given *7GB* to each executor so if I am
right above then it should not throw OOM.
3. And when I have put the* LIMIT 10* then does spark-hive reads all files?
Thanks
--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Email:- [email protected]
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
<https://www.linkedin.com/in/28shivamsharma>*