[GitHub] [spark] cxzl25 commented on pull request #30725: [SPARK-33753][CORE] Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

GitBox Tue, 15 Dec 2020 06:04:21 -0800


cxzl25 commented on pull request #30725:
URL: https://github.com/apache/spark/pull/30725#issuecomment-745305483



   > I see, thanks for the reference. So IIUC this patch is primarily targeting 
the `spark.hadoop.cloneConf = true` use case?
   
   No.
   When `spark.hadoop.cloneConf=false`, `HadoopRDD#getPartitions` will create a 
jobconf and add it to `hadoopJobMetadata` cache.
   When the number of partitions of the queried hive table is large, many 
jobconf objects will be created and added to the cache.
   When the drvier memory configuration is small, the driver will use all the 
memory, and then full gc.
   
   If your hadoop client version is above 2.7, or use the patch of 
[HADOOP-11209](https://issues.apache.org/jira/browse/HADOOP-11209), you can 
enable `spark.hadoop.cloneConf=true`, at this time the driver will not have too 
many jobconf objects.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cxzl25 commented on pull request #30725: [SPARK-33753][CORE] Reduce the memory footprint and gc of the cache (hadoopJobMetadata)

Reply via email to