[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504638#comment-14504638
 ] 

Apache Spark commented on SPARK-6738:
-

User 'shenh062326' has created a pull request for this issue:
https://github.com/apache/spark/pull/5608

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size, there is a bug in 
 SizeEstimator.visitArray.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482987#comment-14482987
 ] 

Sean Owen commented on SPARK-6738:
--

Is that the only file spilled though? I'm not an expert but it looks like lots 
of files are spilled to here.

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 1100M data to disk:
 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.5 MB to disk (12 times so far)
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.3 MB to disk (13 times so far)
 /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1105.8 MB to disk (14 times so far)
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.8 MB to disk (15 times so far)
 But the file size is only 1.1M.
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483011#comment-14483011
 ] 

Hong Shen commented on SPARK-6738:
--

Yes, it spill lots of files, but each one has only 1.1M. 

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 1100M data to disk:
 {code}
 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.5 MB to disk (12 times so far)
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.3 MB to disk (13 times so far)
 /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1105.8 MB to disk (14 times so far)
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.8 MB to disk (15 times so far)
 {code}
 But the file size is only 1.1M.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 {code}
 Here is the other spilled file.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
  
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483025#comment-14483025
 ] 

Sean Owen commented on SPARK-6738:
--

Do you observe a problem? is it possible that you are looking at unserialized 
objects in memory but serialized representation on disk? what is the nature of 
the data? More info would be much more helpful

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 1100M data to disk:
 {code}
 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.5 MB to disk (12 times so far)
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.3 MB to disk (13 times so far)
 /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9
 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1105.8 MB to disk (14 times so far)
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling 
 in-memory map of 1106.8 MB to disk (15 times so far)
 {code}
 But the file size is only 1.1M.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b
 {code}
 Here are the other spilled files.
 {code}
 [tdwadmin@tdw-10-215-149-231 
 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$
  ll -h 
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/*
  
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24:
 total 1.1M
 -rw-r- 1 spark users 1.1M Apr  7 16:40 
 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73
 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e:
 total 2.2M
 -rw-r- 1 spark users 1.1M Apr  7 16:39 
 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43
 -rw-r- 1 spark users 1.1M Apr  7 16:41 
 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483103#comment-14483103
 ] 

Sean Owen commented on SPARK-6738:
--

To be clear I am asking how big the data being spilled is in memory. The GC 
state isnt relevant. That is, are they just compressing 10x on serialization 
into the files you see? It is not crazy.

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-07 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483104#comment-14483104
 ] 

Hong Shen commented on SPARK-6738:
--

I don't think it's serialized cause the problem. the input data is a hive 
table, and the spark job is a spark SQL.
In the fact, when the log show that spilling in-memory map of 2.2 GB to disk, 
the file is only 2.2M, and the GC log show the jvm is less than 1GB. the 
estimateSize also deviation with the jvm memory.


 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org