[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483104#comment-14483104 ]
Hong Shen commented on SPARK-6738: ---------------------------------- I don't think it's serialized cause the problem. the input data is a hive table, and the spark job is a spark SQL. In the fact, when the log show that spilling in-memory map of 2.2 GB to disk, the file is only 2.2M, and the GC log show the jvm is less than 1GB. the estimateSize also deviation with the jvm memory. > EstimateSize is difference with spill file size > ------------------------------------------------ > > Key: SPARK-6738 > URL: https://issues.apache.org/jira/browse/SPARK-6738 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.2.0 > Reporter: Hong Shen > > ExternalAppendOnlyMap spill 2.2 GB data to disk: > {code} > 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling > in-memory map of 2.2 GB to disk (61 times so far) > 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: > /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 > {code} > But the file size is only 2.2M. > {code} > ll -h > /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ > total 2.2M > -rw-r----- 1 spark users 2.2M Apr 7 20:27 > temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 > {code} > The GC log show that the jvm memory is less than 1GB. > {code} > 2015-04-07T20:27:08.023+0800: [GC 981981K->55363K(3961344K), 0.0341720 secs] > 2015-04-07T20:27:14.483+0800: [GC 987523K->53737K(3961344K), 0.0252660 secs] > 2015-04-07T20:27:20.793+0800: [GC 985897K->56370K(3961344K), 0.0606460 secs] > 2015-04-07T20:27:27.553+0800: [GC 988530K->59089K(3961344K), 0.0651840 secs] > 2015-04-07T20:27:34.067+0800: [GC 991249K->62153K(3961344K), 0.0288460 secs] > 2015-04-07T20:27:40.180+0800: [GC 994313K->61344K(3961344K), 0.0388970 secs] > 2015-04-07T20:27:46.490+0800: [GC 993504K->59915K(3961344K), 0.0235150 secs] > {code} > The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org