[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056601#comment-16056601 ] Charles Allen commented on SPARK-19111: --- [~ste...@apache.org] It makes it unstable because the history server chokes on such large files. The task finishes though. We didn't debug what about the large files is making the history server choke, but if I recall the history files were on the order of tens of GB, so it wouldn't shock me if the history server didn't handle them efficiently. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056516#comment-16056516 ] Steve Loughran commented on SPARK-19111: Followup: [~drcrallen]; Hadoop 2.8 is out the door with the JARs you need for the S3a block output. Curious why it's blocking the history server though. SPARK-11373 adds metrics to the SHS, which may help > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848403#comment-15848403 ] Steve Loughran commented on SPARK-19111: Charles, you might also want to keep an eye on: HADOOP-14028 . Until that fix is in, s3a will leak disk space for the life of the process, here a 50GB file. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840419#comment-15840419 ] Charles Allen commented on SPARK-19111: --- While switching to s3a helped the logs upload, it made the spark history server unusable, which is probably another bug. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload > 2017-01-06T21:59:44,679 INFO [main] >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840418#comment-15840418 ] Charles Allen commented on SPARK-19111: --- We have a patch https://github.com/apache/spark/pull/16714 for SPARK-16333 which fixes the problem on our side by disabling the verbose new metrics. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832473#comment-15832473 ] Jisoo Kim commented on SPARK-19111: --- Related to https://issues.apache.org/jira/browse/SPARK-19316. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload > 2017-01-06T21:59:44,679 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832458#comment-15832458 ] Jisoo Kim commented on SPARK-19111: --- Thanks [~ste...@apache.org] for information, using S3a helped with uploading the event logs (I used your PR https://github.com/apache/spark/pull/12004 to use S3a). Turned out that the size of the event log was dramatically larger than it used to be with Spark 1.5.2 (~30x). > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821773#comment-15821773 ] Steve Loughran commented on SPARK-19111: Just realised one more thing If the allocated threads for writing data are all used up in PUT calls, the thread calling write() will be blocked. And that tuning I mentioned was "keep #of publisher threads down to minimise heap use". This means: if the log creation rate is > upload bandwidth for some period of time, whichever thread is converting history events to write calls will block. Hadoop 2.8 s3a reduces risk here as it will buffer to HDD by default, but even there, if there's a mismatch between data generation and upload rates, you'll be in trouble > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812304#comment-15812304 ] Charles Allen commented on SPARK-19111: --- That's great information, thank you [~ste...@apache.org] > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload > 2017-01-06T21:59:44,679 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812301#comment-15812301 ] Charles Allen commented on SPARK-19111: --- I was also going to have the folks here look at the closing sequence to see why the spark executor lifecycle wasn't waiting for the close to complete, or not reporting the error. https://issues.apache.org/jira/browse/SPARK-12330 was filed previously as a "exits too fast" bug. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812286#comment-15812286 ] Charles Allen commented on SPARK-19111: --- If you prefer "re-open on more data" then I can definitely accommodate that. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload > 2017-01-06T21:59:44,679 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15811767#comment-15811767 ] Steve Loughran commented on SPARK-19111: What's happening here is that (a) S3n isn't uploading until the final close(), making it an O(data) operation. Maybe, that close() call is happening in something catching/swallowing exceptions. Or simply the close() is taking so long to upload that some other thread actually terminates the process. # please switch to S3a; it's the one that's being maintained. S3n is being deliberately left alone to avoid breaking things in different ways than today. # and move to Hadoop 2.7.x JARs, where S3a is ready to play with S3A on Hadoop 2.7 hasthe ability to do incremental writes of blocks of data, the Fast Output Stream. This removes size limits and, because it is incrementally uploading, the only actions in close() are: upload the remaining data (guaranteed to be < one block); wait for all outstanding block uploads to complete, and then PUT the final commit-multipart document. It will need careful to avoid running out of heap, as it buffers in RAM, and the defaults are a bit overoptimistic as to upload bandwidth. YMMV. Hadoop 2.8 adds HADOOP-13560, and the BlockOutputStream successor to the FastOutputStream, which can buffer via /tmp or off-heap buffers, scales nicely and is tested to many, many GB. I'd grab that as soon as you can. Irrespective of which upload option is used, you aren't going to see any history file until close() completes: that's how object stores work. To close then: use S3a & Hadoop 2.7.x; the fast output stream may make this problem go away through incremental/background writes of the history. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15811469#comment-15811469 ] Sean Owen commented on SPARK-19111: --- In practice, there's almost no chance there will be a follow-up on this if not from you. And I note that tickets can be reopened when situations change. I tend to favor therefore not keeping items like this open. It's hard to feel strongly about it as there are thousands of dead JIRAs here anyway. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem -
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809837#comment-15809837 ] Charles Allen commented on SPARK-19111: --- So to clarify, I agree this ticket is not actionable as is, but I have not seen this effect recorded in another ticket (is it?), so I propose leaving this open until more information is available either from our side or from others who report a similar problem. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key >
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807603#comment-15807603 ] Charles Allen commented on SPARK-19111: --- I have not been able to finish root cause stuff, but I know it works for jobs except for our largest spark job. And it consistently fails for that large spark job. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807377#comment-15807377 ] Sean Owen commented on SPARK-19111: --- Is there any more detail on why? are you sure it's a size issue, that it isn't just a problem with the network, etc? I'm not sure it's actionable like this. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload >