[jira] [Resolved] (SPARK-12081) Make unified memory management work with small heaps
[ https://issues.apache.org/jira/browse/SPARK-12081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12081. --- Resolution: Fixed Fix Version/s: 1.6.0 > Make unified memory management work with small heaps > > > Key: SPARK-12081 > URL: https://issues.apache.org/jira/browse/SPARK-12081 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Fix For: 1.6.0 > > > By default, Spark drivers and executors are 1GB. With the recent unified > memory mode, only 250MB is set aside for non-storage non-execution purposes > (spark.memory.fraction is 75%). However, especially in local mode, the driver > needs at least ~300MB. Some local jobs started to OOM because of this. > Two mutually exclusive proposals: > (1) First, cut out 300 MB, then take 75% of what remains > (2) Use min(75% of JVM heap size, JVM heap size - 300MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12081) Make unified memory management work with small heaps
[ https://issues.apache.org/jira/browse/SPARK-12081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035219#comment-15035219 ] Andrew Or commented on SPARK-12081: --- The patch took approach (1) > Make unified memory management work with small heaps > > > Key: SPARK-12081 > URL: https://issues.apache.org/jira/browse/SPARK-12081 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Fix For: 1.6.0 > > > By default, Spark drivers and executors are 1GB. With the recent unified > memory mode, only 250MB is set aside for non-storage non-execution purposes > (spark.memory.fraction is 75%). However, especially in local mode, the driver > needs at least ~300MB. Some local jobs started to OOM because of this. > Two mutually exclusive proposals: > (1) First, cut out 300 MB, then take 75% of what remains > (2) Use min(75% of JVM heap size, JVM heap size - 300MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5106) Make Web UI automatically refresh/update displayed data
[ https://issues.apache.org/jira/browse/SPARK-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-5106. -- Resolution: Won't Fix > Make Web UI automatically refresh/update displayed data > --- > > Key: SPARK-5106 > URL: https://issues.apache.org/jira/browse/SPARK-5106 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Ryan Williams > > My (and presumably others') experience monitoring Spark jobs currently > consists of repeatedly ⌘R'ing various pages of the web UI to view > ever-fresher data about how many tasks have succeeded / failed, how much > spillage is happening, etc., which is tedious. > Particularly unfortunate is the "one refresh over the line" problem where, > just as things are getting interesting, the job itself fails or finishes, and > after refreshing the page all data disappears. > It would be good if the web UI updated the data it was displaying > automatically. > One hacky way to achieve this would be to have it automatically refresh the > page, though this still risks losing everything when the job finishes. > A better long-term solution would be to have the UI poll for (or have pushed > to it) updates to the data it is displaying. > Either way, some way to toggle this functionality on or off is probably > warranted as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous
[ https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034405#comment-15034405 ] Andrew Or commented on SPARK-12062: --- Great, I've assigned it to you > Master rebuilding historical SparkUI should be asynchronous > --- > > Key: SPARK-12062 > URL: https://issues.apache.org/jira/browse/SPARK-12062 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Bryan Cutler > > When a long-running application finishes, it takes a while (sometimes > minutes) to rebuild the SparkUI. However, in Master.scala this is currently > done within the RPC event loop, which runs only in 1 thread. Thus, in the > mean time no other applications can register with this master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous
[ https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12062: -- Assignee: Bryan Cutler > Master rebuilding historical SparkUI should be asynchronous > --- > > Key: SPARK-12062 > URL: https://issues.apache.org/jira/browse/SPARK-12062 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Bryan Cutler > > When a long-running application finishes, it takes a while (sometimes > minutes) to rebuild the SparkUI. However, in Master.scala this is currently > done within the RPC event loop, which runs only in 1 thread. Thus, in the > mean time no other applications can register with this master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups
[ https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034447#comment-15034447 ] Andrew Or commented on SPARK-8414: -- I'll submit a patch today. > Ensure ContextCleaner actually triggers clean ups > - > > Key: SPARK-8414 > URL: https://issues.apache.org/jira/browse/SPARK-8414 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > Right now it cleans up old references only through natural GCs, which may not > occur if the driver has infinite RAM. We should do a periodic GC to make sure > that we actually do clean things up. Something like once per 30 minutes seems > relatively inexpensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups
[ https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8414: - Target Version/s: 1.6.0 > Ensure ContextCleaner actually triggers clean ups > - > > Key: SPARK-8414 > URL: https://issues.apache.org/jira/browse/SPARK-8414 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > Right now it cleans up old references only through natural GCs, which may not > occur if the driver has infinite RAM. We should do a periodic GC to make sure > that we actually do clean things up. Something like once per 30 minutes seems > relatively inexpensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12059) Standalone Master assertion error
[ https://issues.apache.org/jira/browse/SPARK-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12059: -- Description: {code} 15/11/30 09:55:04 ERROR Inbox: Ignoring error java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) {code} was: {code} java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) {code} > Standalone Master assertion error > - > > Key: SPARK-12059 > URL: https://issues.apache.org/jira/browse/SPARK-12059 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Saisai Shao >Priority: Critical > > {code} > 15/11/30 09:55:04 ERROR Inbox: Ignoring error > java.lang.AssertionError: assertion failed: executor 4 state transfer from > RUNNING to RUNNING is illegal > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12037) Executors use heartbeatReceiverRef to report heartbeats and task metrics that might not be initialized and leads to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-12037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12037. --- Resolution: Fixed Assignee: Nan Zhu Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Executors use heartbeatReceiverRef to report heartbeats and task metrics that > might not be initialized and leads to NullPointerException > > > Key: SPARK-12037 > URL: https://issues.apache.org/jira/browse/SPARK-12037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: The latest sources at revision {{c793d2d}} >Reporter: Jacek Laskowski >Assignee: Nan Zhu > Fix For: 1.6.0 > > > When {{Executor}} starts it starts driver heartbeater (using > {{startDriverHeartbeater()}}) that uses {{heartbeatReceiverRef}} that is > initialized later and there is a possibility of NullPointerException (after > {{spark.executor.heartbeatInterval}} or {{10s}}). > {code} > WARN Executor: Issue communicating with driver in heartbeater > java.lang.NullPointerException > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:447) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:467) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:467) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:467) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1717) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:467) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12035) Add more debug information in include_example tag of Jekyll
[ https://issues.apache.org/jira/browse/SPARK-12035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12035. --- Resolution: Fixed Assignee: Xusen Yin Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Add more debug information in include_example tag of Jekyll > --- > > Key: SPARK-12035 > URL: https://issues.apache.org/jira/browse/SPARK-12035 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Reporter: Xusen Yin >Assignee: Xusen Yin >Priority: Minor > Labels: documentation > Fix For: 1.6.0 > > > Add more debug information in the include_example tag of Jekyll, so that we > can know more when facing with errors of `jekyll build`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12007) Network library's RPC layer requires a lot of copying
[ https://issues.apache.org/jira/browse/SPARK-12007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12007. --- Resolution: Fixed Fix Version/s: 1.6.0 > Network library's RPC layer requires a lot of copying > - > > Key: SPARK-12007 > URL: https://issues.apache.org/jira/browse/SPARK-12007 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.6.0 > > > The network library's RPC layer has an external API based on byte arrays, > instead of ByteBuffer; that requires a lot of copying since the internals of > the library use ByteBuffers (or rather Netty's ByteBuf), and lots of external > clients also use ByteBuffer. > The extra copies could be avoided if the API used ByteBuffer instead. > To show an extreme case, look at an RPC send via NettyRpcEnv: > - message is encoded using JavaSerializer, resulting in a ByteBuffer > - the ByteBuffer is copied into a byte array of the right size, since its > internal array may be larger than the actual data it holds > - the network library's encoder copies the byte array into a ByteBuf > - finally the data is written to the socket > The intermediate 2 copies could be avoided if the API allowed the original > ByteBuffer to be sent instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize
[ https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12060: -- Assignee: Shixiong Zhu > Avoid memory copy in JavaSerializerInstance.serialize > - > > Key: SPARK-12060 > URL: https://issues.apache.org/jira/browse/SPARK-12060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to > get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the > content in the internal array to a new array. However, since the array will > be converted to ByteBuffer at once, we can avoid the memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12007) Network library's RPC layer requires a lot of copying
[ https://issues.apache.org/jira/browse/SPARK-12007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12007: -- Affects Version/s: 1.6.0 Target Version/s: 1.6.0 > Network library's RPC layer requires a lot of copying > - > > Key: SPARK-12007 > URL: https://issues.apache.org/jira/browse/SPARK-12007 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > The network library's RPC layer has an external API based on byte arrays, > instead of ByteBuffer; that requires a lot of copying since the internals of > the library use ByteBuffers (or rather Netty's ByteBuf), and lots of external > clients also use ByteBuffer. > The extra copies could be avoided if the API used ByteBuffer instead. > To show an extreme case, look at an RPC send via NettyRpcEnv: > - message is encoded using JavaSerializer, resulting in a ByteBuffer > - the ByteBuffer is copied into a byte array of the right size, since its > internal array may be larger than the actual data it holds > - the network library's encoder copies the byte array into a ByteBuf > - finally the data is written to the socket > The intermediate 2 copies could be avoided if the API allowed the original > ByteBuffer to be sent instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12007) Network library's RPC layer requires a lot of copying
[ https://issues.apache.org/jira/browse/SPARK-12007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12007: -- Assignee: Marcelo Vanzin > Network library's RPC layer requires a lot of copying > - > > Key: SPARK-12007 > URL: https://issues.apache.org/jira/browse/SPARK-12007 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > The network library's RPC layer has an external API based on byte arrays, > instead of ByteBuffer; that requires a lot of copying since the internals of > the library use ByteBuffers (or rather Netty's ByteBuf), and lots of external > clients also use ByteBuffer. > The extra copies could be avoided if the API used ByteBuffer instead. > To show an extreme case, look at an RPC send via NettyRpcEnv: > - message is encoded using JavaSerializer, resulting in a ByteBuffer > - the ByteBuffer is copied into a byte array of the right size, since its > internal array may be larger than the actual data it holds > - the network library's encoder copies the byte array into a ByteBuf > - finally the data is written to the socket > The intermediate 2 copies could be avoided if the API allowed the original > ByteBuffer to be sent instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous
Andrew Or created SPARK-12062: - Summary: Master rebuilding historical SparkUI should be asynchronous Key: SPARK-12062 URL: https://issues.apache.org/jira/browse/SPARK-12062 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Andrew Or When a long-running application finishes, it takes a while (sometimes minutes) to rebuild the SparkUI. However, in Master.scala this is currently done within the RPC event loop, which runs only in 1 thread. Thus, in the mean time no other applications can register with this master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize
[ https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12060: -- Target Version/s: 1.6.0 Component/s: Spark Core > Avoid memory copy in JavaSerializerInstance.serialize > - > > Key: SPARK-12060 > URL: https://issues.apache.org/jira/browse/SPARK-12060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu > > JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to > get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the > content in the internal array to a new array. However, since the array will > be converted to ByteBuffer at once, we can avoid the memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12059) Standalone Master assertion error
Andrew Or created SPARK-12059: - Summary: Standalone Master assertion error Key: SPARK-12059 URL: https://issues.apache.org/jira/browse/SPARK-12059 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.6.0 Reporter: Andrew Or Assignee: Saisai Shao Priority: Critical {code} java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11999) ThreadUtils.newDaemonCachedThreadPool(prefix, maxThreadNumber) has unexpected behavior
[ https://issues.apache.org/jira/browse/SPARK-11999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11999: -- Assignee: Shixiong Zhu > ThreadUtils.newDaemonCachedThreadPool(prefix, maxThreadNumber) has > unexpected behavior > --- > > Key: SPARK-11999 > URL: https://issues.apache.org/jira/browse/SPARK-11999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Currently, ThreadUtils.newDaemonCachedThreadPool(prefix, maxThreadNumber) > will throw RejectedExecutionException, if there are already `maxThreadNumber` > busy threads and we submit a new task. It's because `SynchronousQueue` cannot > cache any task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10864) SparkUI: app name is hidden if window is resized
[ https://issues.apache.org/jira/browse/SPARK-10864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10864. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > SparkUI: app name is hidden if window is resized > > > Key: SPARK-10864 > URL: https://issues.apache.org/jira/browse/SPARK-10864 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Andrew Or >Priority: Minor > Fix For: 1.6.0 > > Attachments: Screen Shot 2015-09-28 at 5.44.06 PM.png > > > See screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10558) Wrong executor state in standalone master because of wrong state transition
[ https://issues.apache.org/jira/browse/SPARK-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10558. --- Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Wrong executor state in standalone master because of wrong state transition > --- > > Key: SPARK-10558 > URL: https://issues.apache.org/jira/browse/SPARK-10558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 1.6.0 > > > Because of concurrency issue in executor state transition, the executor state > saved in standalone may possibly be {{LOADING}} rather than {{RUNNING}}. This > is because of {{RUNNING}} state is delivered earlier than {{LOADING}}. > We have to guarantee the correct state changing, like: LAUNCHING -> LOADING > -> RUNNING. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11880) On Windows spark-env.cmd is not loaded.
[ https://issues.apache.org/jira/browse/SPARK-11880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11880: -- Assignee: tawan > On Windows spark-env.cmd is not loaded. > --- > > Key: SPARK-11880 > URL: https://issues.apache.org/jira/browse/SPARK-11880 > Project: Spark > Issue Type: Bug > Components: Windows > Environment: Windows >Reporter: Gaurav Sehgal >Assignee: tawan >Priority: Trivial > Fix For: 1.6.0 > > > On windows the bin/load-spark-env.cmd tries to load file from > %~dp0..\..\conf. Where ~dp0 points to bin and conf is only one level up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11880) On Windows spark-env.cmd is not loaded.
[ https://issues.apache.org/jira/browse/SPARK-11880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11880. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > On Windows spark-env.cmd is not loaded. > --- > > Key: SPARK-11880 > URL: https://issues.apache.org/jira/browse/SPARK-11880 > Project: Spark > Issue Type: Bug > Components: Windows > Environment: Windows >Reporter: Gaurav Sehgal >Priority: Trivial > Fix For: 1.6.0 > > > On windows the bin/load-spark-env.cmd tries to load file from > %~dp0..\..\conf. Where ~dp0 points to bin and conf is only one level up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10864) SparkUI: app name is hidden if window is resized
[ https://issues.apache.org/jira/browse/SPARK-10864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10864: -- Assignee: Alexander Bozarth > SparkUI: app name is hidden if window is resized > > > Key: SPARK-10864 > URL: https://issues.apache.org/jira/browse/SPARK-10864 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Andrew Or >Assignee: Alexander Bozarth >Priority: Minor > Fix For: 1.6.0 > > Attachments: Screen Shot 2015-09-28 at 5.44.06 PM.png > > > See screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11866) RpcEnv RPC timeouts can lead to errors, leak in transport library.
[ https://issues.apache.org/jira/browse/SPARK-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11866: -- Priority: Major (was: Minor) > RpcEnv RPC timeouts can lead to errors, leak in transport library. > -- > > Key: SPARK-11866 > URL: https://issues.apache.org/jira/browse/SPARK-11866 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > The {{RpcEnv}} code in spark-core has its own timeout handling capabilities, > which can clash with the transport library's timeout handling in two ways > when replies to an RPC message are never sent. > - if the channel has been idle for a while, the transport library will close > the channel because it may think it's hung; this could cause other errors > since the {{RpcEnv}}-based code might not expect those channels to be closed. > - if the reply never arrives and the channel is not idle, there's state kept > in the network library that will never be cleaned up. the {{RpcEnv}}-level > timeout code should clean up that state since it's not interested in that RPC > anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11866) RpcEnv RPC timeouts can lead to errors, leak in transport library.
[ https://issues.apache.org/jira/browse/SPARK-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11866: -- Target Version/s: 1.6.0 > RpcEnv RPC timeouts can lead to errors, leak in transport library. > -- > > Key: SPARK-11866 > URL: https://issues.apache.org/jira/browse/SPARK-11866 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > The {{RpcEnv}} code in spark-core has its own timeout handling capabilities, > which can clash with the transport library's timeout handling in two ways > when replies to an RPC message are never sent. > - if the channel has been idle for a while, the transport library will close > the channel because it may think it's hung; this could cause other errors > since the {{RpcEnv}}-based code might not expect those channels to be closed. > - if the reply never arrives and the channel is not idle, there's state kept > in the network library that will never be cleaned up. the {{RpcEnv}}-level > timeout code should clean up that state since it's not interested in that RPC > anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Target Version/s: 1.5.3, 1.6.0 > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Labels: flaky-test (was: ) > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen > Labels: flaky-test > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Assignee: Shixiong Zhu > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Component/s: Tests > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD
[ https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11845. --- Resolution: Fixed Fix Version/s: 1.6.0 > Add unit tests to verify correct checkpointing of TrackStateRDD > --- > > Key: SPARK-11845 > URL: https://issues.apache.org/jira/browse/SPARK-11845 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11843) Isolate staging directory across applications on same YARN cluster
[ https://issues.apache.org/jira/browse/SPARK-11843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014994#comment-15014994 ] Andrew Or commented on SPARK-11843: --- Oops, that seems to be the case. I'm closing this. > Isolate staging directory across applications on same YARN cluster > -- > > Key: SPARK-11843 > URL: https://issues.apache.org/jira/browse/SPARK-11843 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Andrew Or >Priority: Minor > > If multiple clients share the same YARN cluster and file system they may end > up using the same `.sparkStaging` directory. This may be a problem if their > jars are called something similar, for instance. It would be easier to > enforce isolation for both security and user experience if the staging > directories are isolated. We can either: > (1) allow users to configure the directory name > (2) add an identifier to the directory name, which I prefer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11843) Isolate staging directory across applications on same YARN cluster
[ https://issues.apache.org/jira/browse/SPARK-11843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11843. --- Resolution: Won't Fix > Isolate staging directory across applications on same YARN cluster > -- > > Key: SPARK-11843 > URL: https://issues.apache.org/jira/browse/SPARK-11843 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Andrew Or >Priority: Minor > > If multiple clients share the same YARN cluster and file system they may end > up using the same `.sparkStaging` directory. This may be a problem if their > jars are called something similar, for instance. It would be easier to > enforce isolation for both security and user experience if the staging > directories are isolated. We can either: > (1) allow users to configure the directory name > (2) add an identifier to the directory name, which I prefer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Fix Version/s: 1.5.3 > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > Fix For: 1.5.3, 1.6.0 > > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Target Version/s: 1.5.3, 1.6.0 (was: 1.6.0) > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > Fix For: 1.5.3, 1.6.0 > > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014892#comment-15014892 ] Andrew Or commented on SPARK-11278: --- thanks [~nravi] that's very helpful. > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Assignee: Andrew Or >Priority: Critical > Attachments: executor_log_legacyModeTrue.html, > executor_logs_legacyModeFalse.html > > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11746) Use cache-aware method 'dependencies' to instead of 'getDependencies'
[ https://issues.apache.org/jira/browse/SPARK-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11746: -- Assignee: SuYan > Use cache-aware method 'dependencies' to instead of 'getDependencies' > -- > > Key: SPARK-11746 > URL: https://issues.apache.org/jira/browse/SPARK-11746 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: SuYan >Assignee: SuYan >Priority: Minor > Fix For: 1.6.0 > > > Use cache-aware method 'dependencies' to instead of 'getDependencies' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Fix Version/s: 1.6.0 > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > Fix For: 1.6.0 > > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11828) DAGScheduler source registered too early with MetricsSystem
[ https://issues.apache.org/jira/browse/SPARK-11828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11828. --- Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > DAGScheduler source registered too early with MetricsSystem > --- > > Key: SPARK-11828 > URL: https://issues.apache.org/jira/browse/SPARK-11828 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 1.6.0 > > > I see this log message when starting apps on YARN: > {quote} > 15/11/18 13:12:56 WARN MetricsSystem: Using default name DAGScheduler for > source because spark.app.id is not set. > {quote} > That's because DAGScheduler registers itself with the metrics system in its > constructor, and the DAGScheduler is instantiated before "spark.app.id" is > set in the context's SparkConf. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11746) Use checkpoint-aware method 'dependencies' to instead of 'getDependencies'
[ https://issues.apache.org/jira/browse/SPARK-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11746: -- Summary: Use checkpoint-aware method 'dependencies' to instead of 'getDependencies' (was: Use cache-aware method 'dependencies' to instead of 'getDependencies') > Use checkpoint-aware method 'dependencies' to instead of 'getDependencies' > --- > > Key: SPARK-11746 > URL: https://issues.apache.org/jira/browse/SPARK-11746 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: SuYan >Assignee: SuYan >Priority: Minor > Fix For: 1.6.0 > > > Use cache-aware method 'dependencies' to instead of 'getDependencies' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014648#comment-15014648 ] Andrew Or commented on SPARK-11831: --- do we need to backport this into 1.5? > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > Fix For: 1.6.0 > > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11831. --- Resolution: Fixed > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > Fix For: 1.6.0 > > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11831) AkkaRpcEnvSuite is prone to port-contention-related flakiness
[ https://issues.apache.org/jira/browse/SPARK-11831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11831: -- Target Version/s: 1.6.0 (was: 1.5.3, 1.6.0) > AkkaRpcEnvSuite is prone to port-contention-related flakiness > - > > Key: SPARK-11831 > URL: https://issues.apache.org/jira/browse/SPARK-11831 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen >Assignee: Shixiong Zhu > Labels: flaky-test > Fix For: 1.6.0 > > > The AkkaRpcEnvSuite tests appear to be prone to port-contention-related > flakiness in Jenkins: > {code} > Error Message > Failed to bind to: localhost/127.0.0.1:12362: Service 'test' failed after 16 > retries! > Stacktrace > java.net.BindException: Failed to bind to: localhost/127.0.0.1:12362: > Service 'test' failed after 16 retries! > at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) > at > akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/4819/HADOOP_VERSION=1.2.1,label=spark-test/testReport/junit/org.apache.spark.rpc.akka/AkkaRpcEnvSuite/uriOf__ssl/ > We should probably refactor these tests to not depend on a fixed port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11799) Make it explicit in executor logs that uncaught exceptions are thrown during executor shutdown
[ https://issues.apache.org/jira/browse/SPARK-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11799: -- Assignee: Srinivasa Reddy Vundela > Make it explicit in executor logs that uncaught exceptions are thrown during > executor shutdown > -- > > Key: SPARK-11799 > URL: https://issues.apache.org/jira/browse/SPARK-11799 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Srinivasa Reddy Vundela >Assignee: Srinivasa Reddy Vundela >Priority: Minor > > Here is some background for the issue. > Customer got OOM exception in one of the task and executor got killed with > kill %p. Few shutdown hooks are registered with ShutDownHookManager to do the > hadoop temp directory cleanup. During this shutdown phase other tasks are > throwing uncaught exception and executor logs are filled up with so many of > them. > Since it is unclear for the customer in driver logs/ Spark UI why the > container was lost customer is going through the executor logs and he see lot > of uncaught exception. > It would be clear to the customer if we can prepend the uncaught exceptions > with some message like [Container is in shutdown mode] so that he can skip > those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11799) Make it explicit in executor logs that uncaught exceptions are thrown during executor shutdown
[ https://issues.apache.org/jira/browse/SPARK-11799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11799. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Make it explicit in executor logs that uncaught exceptions are thrown during > executor shutdown > -- > > Key: SPARK-11799 > URL: https://issues.apache.org/jira/browse/SPARK-11799 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Srinivasa Reddy Vundela >Assignee: Srinivasa Reddy Vundela >Priority: Minor > Fix For: 1.6.0 > > > Here is some background for the issue. > Customer got OOM exception in one of the task and executor got killed with > kill %p. Few shutdown hooks are registered with ShutDownHookManager to do the > hadoop temp directory cleanup. During this shutdown phase other tasks are > throwing uncaught exception and executor logs are filled up with so many of > them. > Since it is unclear for the customer in driver logs/ Spark UI why the > container was lost customer is going through the executor logs and he see lot > of uncaught exception. > It would be clear to the customer if we can prepend the uncaught exceptions > with some message like [Container is in shutdown mode] so that he can skip > those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11746) Use cache-aware method 'dependencies' to instead of 'getDependencies'
[ https://issues.apache.org/jira/browse/SPARK-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11746. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Use cache-aware method 'dependencies' to instead of 'getDependencies' > -- > > Key: SPARK-11746 > URL: https://issues.apache.org/jira/browse/SPARK-11746 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: SuYan >Assignee: SuYan >Priority: Minor > Fix For: 1.6.0 > > > Use cache-aware method 'dependencies' to instead of 'getDependencies' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4134) Dynamic allocation: tone down scary executor lost messages when killing on purpose
[ https://issues.apache.org/jira/browse/SPARK-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4134: - Assignee: Marcelo Vanzin (was: Andrew Or) > Dynamic allocation: tone down scary executor lost messages when killing on > purpose > -- > > Key: SPARK-4134 > URL: https://issues.apache.org/jira/browse/SPARK-4134 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Marcelo Vanzin > > After SPARK-3822 goes in, we are now able to dynamically kill executors after > an application has started. However, when we do that we get a ton of scary > error messages telling us that we've done wrong somehow. It would be good to > detect when this is the case and prevent these messages from surfacing. > This maybe difficult, however, because the connection manager tends to be > quite verbose in unconditionally logging disconnection messages. This is a > very nice-to-have for 1.2 but certainly not a blocker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11809) Switch the default Mesos mode to coarse-grained mode
[ https://issues.apache.org/jira/browse/SPARK-11809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11809: -- Component/s: (was: SQL) Mesos > Switch the default Mesos mode to coarse-grained mode > > > Key: SPARK-11809 > URL: https://issues.apache.org/jira/browse/SPARK-11809 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > > Based on my conversions with people, I believe the consensus is that the > coarse-grained mode is more stable and easier to reason about. It is best to > use that as the default rather than the more flaky fine-grained mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7471) DAG visualization: show call site information
[ https://issues.apache.org/jira/browse/SPARK-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7471. Resolution: Duplicate Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > DAG visualization: show call site information > - > > Key: SPARK-7471 > URL: https://issues.apache.org/jira/browse/SPARK-7471 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.6.0 > > > It would be useful to find the line that created the RDD / scope. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11700) Memory leak at SparkContext jobProgressListener stageIdToData map
[ https://issues.apache.org/jira/browse/SPARK-11700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11700: -- Assignee: Shixiong Zhu > Memory leak at SparkContext jobProgressListener stageIdToData map > - > > Key: SPARK-11700 > URL: https://issues.apache.org/jira/browse/SPARK-11700 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Ubuntu 14.04 LTS, Oracle JDK 1.8.51 Apache tomcat > 8.0.28. Spring 4 >Reporter: Kostas papageorgopoulos >Assignee: Shixiong Zhu >Priority: Critical > Labels: leak, memory-leak > Attachments: AbstractSparkJobRunner.java, > SparkContextPossibleMemoryLeakIDEA_DEBUG.png, SparkHeapSpaceProgress.png, > SparkMemoryAfterLotsOfConsecutiveRuns.png, > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png > > > it seems that there is A SparkContext jobProgressListener memory leak.*. > Bellow i describe the steps i do to reproduce that. > I have created a java webapp trying to abstractly Run some Spark Sql jobs > that read data from HDFS (join them) and Write them To ElasticSearch using ES > hadoop connector. After a Lot of consecutive runs i noticed that my heap > space was full so i got an out of heap space error. > At the attached file {code} AbstractSparkJobRunner {code} the {code} public > final void run(T jobConfiguration, ExecutionLog executionLog) throws > Exception {code} runs each time an Spark Sql Job is triggered. So tried to > reuse the same SparkContext for a number of consecutive runs. If some rules > apply i try to clean up the SparkContext by first calling {code} > killSparkAndSqlContext {code}. This code eventually runs {code} synchronized > (sparkContextThreadLock) { > if (javaSparkContext != null) { > LOGGER.info("!!! CLEARING SPARK > CONTEXT!!!"); > javaSparkContext.stop(); > javaSparkContext = null; > sqlContext = null; > System.gc(); > } > numberOfRunningJobsForSparkContext.getAndSet(0); > } > {code}. > So at some point in time i suppose that if no other SparkSql job should run i > should kill the sparkContext (The > AbstractSparkJobRunner.killSparkAndSqlContext runs) and this should be > garbage collected from garbage collector. However this is not the case, Even > if in my debugger shows that my JavaSparkContext object is null see attached > picture {code} SparkContextPossibleMemoryLeakIDEA_DEBUG.png {code}. > The jvisual vm shows an incremental heap space even when the garbage > collector is called. See attached picture {code} SparkHeapSpaceProgress.png > {code}. > The memory analyser Tool shows that a big part of the retained heap to be > assigned to _jobProgressListener see attached picture {code} > SparkMemoryAfterLotsOfConsecutiveRuns.png {code} and summary picture {code} > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png {code}. Although at > the same time in Singleton Service the JavaSparkContext is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11843) Isolate staging directory across applications on same YARN cluster
Andrew Or created SPARK-11843: - Summary: Isolate staging directory across applications on same YARN cluster Key: SPARK-11843 URL: https://issues.apache.org/jira/browse/SPARK-11843 Project: Spark Issue Type: Bug Components: YARN Reporter: Andrew Or Priority: Minor If multiple clients share the same YARN cluster and file system they may end up using the same `.sparkStaging` directory. This may be a problem if their jars are called something similar, for instance. It would be easier to enforce isolation for both security and user experience if the staging directories are isolated. We can either: (1) allow users to configure the directory name (2) add an identifier to the directory name, which I prefer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012638#comment-15012638 ] Andrew Or commented on SPARK-11278: --- also, when you said 6 nodes what kind of nodes are they? How much memory / cores per node? > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Priority: Critical > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow
[ https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012601#comment-15012601 ] Andrew Or commented on SPARK-11649: --- I back ported it into 1.5. > "SparkListenerSuite.onTaskGettingResult() called when result fetched > remotely" test is very slow > > > Key: SPARK-11649 > URL: https://issues.apache.org/jira/browse/SPARK-11649 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.0 > > > The SparkListenerSuite "onTaskGettingResult() called when result fetched > remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which > seems excessively slow; we should see if there's an easy way to speed this up: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow
[ https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11649: -- Target Version/s: 1.5.3, 1.6.0 (was: 1.6.0) > "SparkListenerSuite.onTaskGettingResult() called when result fetched > remotely" test is very slow > > > Key: SPARK-11649 > URL: https://issues.apache.org/jira/browse/SPARK-11649 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.0 > > > The SparkListenerSuite "onTaskGettingResult() called when result fetched > remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which > seems excessively slow; we should see if there's an easy way to speed this up: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow
[ https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11649: -- Fix Version/s: 1.5.3 > "SparkListenerSuite.onTaskGettingResult() called when result fetched > remotely" test is very slow > > > Key: SPARK-11649 > URL: https://issues.apache.org/jira/browse/SPARK-11649 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.0 > > > The SparkListenerSuite "onTaskGettingResult() called when result fetched > remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which > seems excessively slow; we should see if there's an easy way to speed this up: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow
[ https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012598#comment-15012598 ] Andrew Or commented on SPARK-11649: --- oh I didn't realize, does the new RPC system have the same problem in master though? > "SparkListenerSuite.onTaskGettingResult() called when result fetched > remotely" test is very slow > > > Key: SPARK-11649 > URL: https://issues.apache.org/jira/browse/SPARK-11649 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.0 > > > The SparkListenerSuite "onTaskGettingResult() called when result fetched > remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which > seems excessively slow; we should see if there's an easy way to speed this up: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012632#comment-15012632 ] Andrew Or commented on SPARK-11278: --- [~nravi] can you try again with the latest 1.6 branch to see if this is still an issue? I wonder how this is different with https://github.com/apache/spark/commit/56419cf11f769c80f391b45dc41b3c7101cc5ff4. > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Priority: Critical > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-11278: - Assignee: Andrew Or > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Assignee: Andrew Or >Priority: Critical > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager
[ https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10985: -- Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-1) > Avoid passing evicted blocks throughout BlockManager / CacheManager > --- > > Key: SPARK-10985 > URL: https://issues.apache.org/jira/browse/SPARK-10985 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Reporter: Andrew Or >Priority: Minor > > This is a minor refactoring task. > Currently when we attempt to put a block in, we get back an array buffer of > blocks that are dropped in the process. We do this to propagate these blocks > back to our TaskContext, which will add them to its TaskMetrics so we can see > them in the SparkUI storage tab properly. > Now that we have TaskContext.get, we can just use that to propagate this > information. This simplifies a lot of the signatures and gets rid of weird > return types like the following everywhere: > {code} > ArrayBuffer[(BlockId, BlockStatus)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11309) Clean up hacky use of MemoryManager inside of HashedRelation
[ https://issues.apache.org/jira/browse/SPARK-11309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11309: -- Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-1) > Clean up hacky use of MemoryManager inside of HashedRelation > > > Key: SPARK-11309 > URL: https://issues.apache.org/jira/browse/SPARK-11309 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen > > In HashedRelation, there's a hacky creation of a new MemoryManager in order > to handle broadcasting of BytesToBytesMap: > https://github.com/apache/spark/blob/85e654c5ec87e666a8845bfd77185c1ea57b268a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L323 > Something similar to this has existed for a while, but the code recently > became much messier as an indirect consequence of my memory manager > consolidation patch. We should see about cleaning this up and removing the > hack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager
[ https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10985: -- Issue Type: Improvement (was: Bug) > Avoid passing evicted blocks throughout BlockManager / CacheManager > --- > > Key: SPARK-10985 > URL: https://issues.apache.org/jira/browse/SPARK-10985 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Andrew Or >Priority: Minor > > This is a minor refactoring task. > Currently when we attempt to put a block in, we get back an array buffer of > blocks that are dropped in the process. We do this to propagate these blocks > back to our TaskContext, which will add them to its TaskMetrics so we can see > them in the SparkUI storage tab properly. > Now that we have TaskContext.get, we can just use that to propagate this > information. This simplifies a lot of the signatures and gets rid of weird > return types like the following everywhere: > {code} > ArrayBuffer[(BlockId, BlockStatus)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager
[ https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10985: -- Target Version/s: (was: 1.6.0) > Avoid passing evicted blocks throughout BlockManager / CacheManager > --- > > Key: SPARK-10985 > URL: https://issues.apache.org/jira/browse/SPARK-10985 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Andrew Or >Priority: Minor > > This is a minor refactoring task. > Currently when we attempt to put a block in, we get back an array buffer of > blocks that are dropped in the process. We do this to propagate these blocks > back to our TaskContext, which will add them to its TaskMetrics so we can see > them in the SparkUI storage tab properly. > Now that we have TaskContext.get, we can just use that to propagate this > information. This simplifies a lot of the signatures and gets rid of weird > return types like the following everywhere: > {code} > ArrayBuffer[(BlockId, BlockStatus)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11700) Memory leak at SparkContext jobProgressListener stageIdToData map
[ https://issues.apache.org/jira/browse/SPARK-11700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11700: -- Priority: Critical (was: Minor) > Memory leak at SparkContext jobProgressListener stageIdToData map > - > > Key: SPARK-11700 > URL: https://issues.apache.org/jira/browse/SPARK-11700 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Ubuntu 14.04 LTS, Oracle JDK 1.8.51 Apache tomcat > 8.0.28. Spring 4 >Reporter: Kostas papageorgopoulos >Priority: Critical > Labels: leak, memory-leak > Attachments: AbstractSparkJobRunner.java, > SparkContextPossibleMemoryLeakIDEA_DEBUG.png, SparkHeapSpaceProgress.png, > SparkMemoryAfterLotsOfConsecutiveRuns.png, > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png > > > it seems that there is A SparkContext jobProgressListener memory leak.*. > Bellow i describe the steps i do to reproduce that. > I have created a java webapp trying to abstractly Run some Spark Sql jobs > that read data from HDFS (join them) and Write them To ElasticSearch using ES > hadoop connector. After a Lot of consecutive runs i noticed that my heap > space was full so i got an out of heap space error. > At the attached file {code} AbstractSparkJobRunner {code} the {code} public > final void run(T jobConfiguration, ExecutionLog executionLog) throws > Exception {code} runs each time an Spark Sql Job is triggered. So tried to > reuse the same SparkContext for a number of consecutive runs. If some rules > apply i try to clean up the SparkContext by first calling {code} > killSparkAndSqlContext {code}. This code eventually runs {code} synchronized > (sparkContextThreadLock) { > if (javaSparkContext != null) { > LOGGER.info("!!! CLEARING SPARK > CONTEXT!!!"); > javaSparkContext.stop(); > javaSparkContext = null; > sqlContext = null; > System.gc(); > } > numberOfRunningJobsForSparkContext.getAndSet(0); > } > {code}. > So at some point in time i suppose that if no other SparkSql job should run i > should kill the sparkContext (The > AbstractSparkJobRunner.killSparkAndSqlContext runs) and this should be > garbage collected from garbage collector. However this is not the case, Even > if in my debugger shows that my JavaSparkContext object is null see attached > picture {code} SparkContextPossibleMemoryLeakIDEA_DEBUG.png {code}. > The jvisual vm shows an incremental heap space even when the garbage > collector is called. See attached picture {code} SparkHeapSpaceProgress.png > {code}. > The memory analyser Tool shows that a big part of the retained heap to be > assigned to _jobProgressListener see attached picture {code} > SparkMemoryAfterLotsOfConsecutiveRuns.png {code} and summary picture {code} > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png {code}. Although at > the same time in Singleton Service the JavaSparkContext is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11700) Memory leak at SparkContext jobProgressListener stageIdToData map
[ https://issues.apache.org/jira/browse/SPARK-11700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11700: -- Target Version/s: 1.6.0 > Memory leak at SparkContext jobProgressListener stageIdToData map > - > > Key: SPARK-11700 > URL: https://issues.apache.org/jira/browse/SPARK-11700 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Ubuntu 14.04 LTS, Oracle JDK 1.8.51 Apache tomcat > 8.0.28. Spring 4 >Reporter: Kostas papageorgopoulos >Priority: Critical > Labels: leak, memory-leak > Attachments: AbstractSparkJobRunner.java, > SparkContextPossibleMemoryLeakIDEA_DEBUG.png, SparkHeapSpaceProgress.png, > SparkMemoryAfterLotsOfConsecutiveRuns.png, > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png > > > it seems that there is A SparkContext jobProgressListener memory leak.*. > Bellow i describe the steps i do to reproduce that. > I have created a java webapp trying to abstractly Run some Spark Sql jobs > that read data from HDFS (join them) and Write them To ElasticSearch using ES > hadoop connector. After a Lot of consecutive runs i noticed that my heap > space was full so i got an out of heap space error. > At the attached file {code} AbstractSparkJobRunner {code} the {code} public > final void run(T jobConfiguration, ExecutionLog executionLog) throws > Exception {code} runs each time an Spark Sql Job is triggered. So tried to > reuse the same SparkContext for a number of consecutive runs. If some rules > apply i try to clean up the SparkContext by first calling {code} > killSparkAndSqlContext {code}. This code eventually runs {code} synchronized > (sparkContextThreadLock) { > if (javaSparkContext != null) { > LOGGER.info("!!! CLEARING SPARK > CONTEXT!!!"); > javaSparkContext.stop(); > javaSparkContext = null; > sqlContext = null; > System.gc(); > } > numberOfRunningJobsForSparkContext.getAndSet(0); > } > {code}. > So at some point in time i suppose that if no other SparkSql job should run i > should kill the sparkContext (The > AbstractSparkJobRunner.killSparkAndSqlContext runs) and this should be > garbage collected from garbage collector. However this is not the case, Even > if in my debugger shows that my JavaSparkContext object is null see attached > picture {code} SparkContextPossibleMemoryLeakIDEA_DEBUG.png {code}. > The jvisual vm shows an incremental heap space even when the garbage > collector is called. See attached picture {code} SparkHeapSpaceProgress.png > {code}. > The memory analyser Tool shows that a big part of the retained heap to be > assigned to _jobProgressListener see attached picture {code} > SparkMemoryAfterLotsOfConsecutiveRuns.png {code} and summary picture {code} > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png {code}. Although at > the same time in Singleton Service the JavaSparkContext is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11309) Clean up hacky use of MemoryManager inside of HashedRelation
[ https://issues.apache.org/jira/browse/SPARK-11309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11309: -- Issue Type: Improvement (was: Bug) > Clean up hacky use of MemoryManager inside of HashedRelation > > > Key: SPARK-11309 > URL: https://issues.apache.org/jira/browse/SPARK-11309 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen > > In HashedRelation, there's a hacky creation of a new MemoryManager in order > to handle broadcasting of BytesToBytesMap: > https://github.com/apache/spark/blob/85e654c5ec87e666a8845bfd77185c1ea57b268a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L323 > Something similar to this has existed for a while, but the code recently > became much messier as an indirect consequence of my memory manager > consolidation patch. We should see about cleaning this up and removing the > hack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10930) History "Stages" page "duration" can be confusing
[ https://issues.apache.org/jira/browse/SPARK-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10930. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > History "Stages" page "duration" can be confusing > - > > Key: SPARK-10930 > URL: https://issues.apache.org/jira/browse/SPARK-10930 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Derek Dagit > Fix For: 1.6.0 > > > The spark history server, "stages" page shows each stage submitted time and > the duration. The duration can be confusing since the time it actually > starts tasks might be much later then its submitted if its waiting on > previous stages. This makes it hard to figure out which stages were really > slow without clicking into each stage. > It would be nice to perhaps have a first task launched time or processing > time spent in each stage to easily be able to find the slow stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7628) DAG visualization: position graphs with semantic awareness
[ https://issues.apache.org/jira/browse/SPARK-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7628. Resolution: Won't Fix > DAG visualization: position graphs with semantic awareness > -- > > Key: SPARK-7628 > URL: https://issues.apache.org/jira/browse/SPARK-7628 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Many streaming operations aggregate over many batches. The current layout > puts the aggregation stage at the end, resulting in many overlapping edges > that together form a piece of beautiful artwork but nevertheless clutter the > intended visualization. > One thing we could do is to put any stage that has N incoming edges on the > next line rather than piling it up vertically on the right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7348) DAG visualization: add links to RDD page
[ https://issues.apache.org/jira/browse/SPARK-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7348: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-7463) > DAG visualization: add links to RDD page > > > Key: SPARK-7348 > URL: https://issues.apache.org/jira/browse/SPARK-7348 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > It currently has links from the job page to the stage page. It would be nice > if it has links to the corresponding RDD page as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7463) DAG visualization improvements
[ https://issues.apache.org/jira/browse/SPARK-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-7463. -- Resolution: Fixed Fix Version/s: 1.6.0 > DAG visualization improvements > -- > > Key: SPARK-7463 > URL: https://issues.apache.org/jira/browse/SPARK-7463 > Project: Spark > Issue Type: Umbrella > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.6.0 > > > This is the umbrella JIRA for improvements or bug fixes to the DAG > visualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow
[ https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11649. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > "SparkListenerSuite.onTaskGettingResult() called when result fetched > remotely" test is very slow > > > Key: SPARK-11649 > URL: https://issues.apache.org/jira/browse/SPARK-11649 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.0 > > > The SparkListenerSuite "onTaskGettingResult() called when result fetched > remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which > seems excessively slow; we should see if there's an easy way to speed this up: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7349) DAG visualization: add legend to explain the content
[ https://issues.apache.org/jira/browse/SPARK-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7349. Resolution: Won't Fix > DAG visualization: add legend to explain the content > > > Key: SPARK-7349 > URL: https://issues.apache.org/jira/browse/SPARK-7349 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > Right now we have red dots and black dots here and there. It's not clear what > they mean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7465) DAG visualization: RDD dependencies not always shown
[ https://issues.apache.org/jira/browse/SPARK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7465. Resolution: Won't Fix > DAG visualization: RDD dependencies not always shown > > > Key: SPARK-7465 > URL: https://issues.apache.org/jira/browse/SPARK-7465 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Currently if the same RDD appears in multiple stages, the arrow will be drawn > only for the first occurrence. It may be too much to show the dependency on > every single occurrence of the same RDD (common in MLlib and GraphX), but we > should at least show them on hover so the user knows where the RDDs are > coming from. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8716) Write tests for executor shared cache feature
[ https://issues.apache.org/jira/browse/SPARK-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8716: - Target Version/s: (was: 1.6.0) > Write tests for executor shared cache feature > - > > Key: SPARK-8716 > URL: https://issues.apache.org/jira/browse/SPARK-8716 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 1.2.0 >Reporter: Andrew Or > > More specifically, this is the feature that is currently flagged by > `spark.files.useFetchCache`. > This is a complicated feature that has no tests. I cannot say with confidence > that it actually works on all cluster managers. In particular, I believe it > doesn't work on Mesos because whatever goes into this else case creates its > own temp directory per executor: > https://github.com/apache/spark/blob/881662e9c93893430756320f51cef0fc6643f681/core/src/main/scala/org/apache/spark/util/Utils.scala#L739. > It's also not immediately clear that it works on standalone mode due to the > lack of comments. It actually does work there because the Worker happens to > set a `SPARK_EXECUTOR_DIRS` variable. The linkage could be more explicitly > documented in the code. > This is difficult to write tests for, but it's still important to do so. > Otherwise, semi-related changes in the future may easily break it without > anyone noticing. > Related issues: SPARK-8130, SPARK-6313, SPARK-2713 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9552) Dynamic allocation kills busy executors on race condition
[ https://issues.apache.org/jira/browse/SPARK-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9552. -- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Dynamic allocation kills busy executors on race condition > - > > Key: SPARK-9552 > URL: https://issues.apache.org/jira/browse/SPARK-9552 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.4.0, 1.4.1 >Reporter: Jie Huang >Assignee: Jie Huang > Fix For: 1.6.0 > > > By using the dynamic allocation, sometimes it occurs false killing for those > busy executors. Some executors with assignments will be killed because of > being idle for enough time (say 60 seconds). The root cause is that the > Task-Launch listener event is asynchronized. > For example, some executors are under assigning tasks, but not sending out > the listener notification yet. Meanwhile, the dynamic allocation's executor > idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the > same time. > the timer expiration starts before the listener event arrives. > Then, the task is going to run on top of that killed/killing executor. It > will lead to task failure finally. > Here is the proposal to fix it. We can add the force control for > killExecutor. If the force control is not set (i.e., false), we'd better to > check if the executor under killing is idle or busy. If the current executor > has some assignment, we should not kill that executor and return back false > (to indicate killing failure). In dynamic allocation, we'd better to turn off > force killing (i.e., force = false), we will meet killing failure if tries to > kill a busy executor. And then, the executor timer won't be invalid. Later > on, the task assignment event arrives, we can remove the idle timer > accordingly. So that we can avoid false killing for those busy executors in > dynamic allocation. > For the rest of usages, the end users can decide if to use force killing or > not by themselves. If to turn on that option, the killExecutor will do the > action without any status checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11790) Flaky test: KafkaStreamTests.test_kafka_direct_stream_foreach_get_offsetRanges
[ https://issues.apache.org/jira/browse/SPARK-11790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11790. --- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Flaky test: > KafkaStreamTests.test_kafka_direct_stream_foreach_get_offsetRanges > --- > > Key: SPARK-11790 > URL: https://issues.apache.org/jira/browse/SPARK-11790 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming, Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: flaky-test > Fix For: 1.6.0 > > > Jenkins link: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46041/consoleFull > {code} > == > ERROR: test_kafka_direct_stream_foreach_get_offsetRanges > (__main__.KafkaStreamTests) > Test the Python direct Kafka stream foreachRDD get offsetRanges. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py", > line 876, in setUp > self._kafkaTestUtils.setup() > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.9-src.zip/py4j/protocol.py", > line 308, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o11914.setup. > : org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to > zookeeper server within timeout: 6000 > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:98) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:84) > at > org.apache.spark.streaming.kafka.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:99) > at > org.apache.spark.streaming.kafka.KafkaTestUtils.setup(KafkaTestUtils.scala:122) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11726) Legacy Netty-RPC based submission in standalone mode does not work
[ https://issues.apache.org/jira/browse/SPARK-11726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11726. --- Resolution: Fixed Assignee: Jacek Laskowski Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Legacy Netty-RPC based submission in standalone mode does not work > -- > > Key: SPARK-11726 > URL: https://issues.apache.org/jira/browse/SPARK-11726 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Reporter: Jacek Lewandowski >Assignee: Jacek Laskowski > Fix For: 1.6.0 > > > When the application is to be submitted in cluster mode and standalone Spark > scheduler is used either legacy RPC based protocol or REST based protocol can > be used. Spark submit firstly tries REST and if it fails it tries RPC. > When Akka based RPC is used, the REST based connection fails immediately > because Akka rejects non-Akka connection. However in Netty based RPC, the > REST client seems to wait for the response indefinitely, thus making it > impossible to fail and try RPC. > The fix is quite simple - set a timeout on reading response from the server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11771) Maximum memory is determined by two params but error message only lists one.
[ https://issues.apache.org/jira/browse/SPARK-11771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11771. --- Resolution: Fixed Assignee: holdenk Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Maximum memory is determined by two params but error message only lists one. > > > Key: SPARK-11771 > URL: https://issues.apache.org/jira/browse/SPARK-11771 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: holdenk >Assignee: holdenk >Priority: Trivial > Fix For: 1.6.0 > > > When we exceed the max memory tell users to increase both params instead of > just the one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11726) Legacy Netty-RPC based submission in standalone mode does not work
[ https://issues.apache.org/jira/browse/SPARK-11726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15010288#comment-15010288 ] Andrew Or commented on SPARK-11726: --- oops, fixed > Legacy Netty-RPC based submission in standalone mode does not work > -- > > Key: SPARK-11726 > URL: https://issues.apache.org/jira/browse/SPARK-11726 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Reporter: Jacek Lewandowski >Assignee: Jacek Lewandowski > Fix For: 1.6.0 > > > When the application is to be submitted in cluster mode and standalone Spark > scheduler is used either legacy RPC based protocol or REST based protocol can > be used. Spark submit firstly tries REST and if it fails it tries RPC. > When Akka based RPC is used, the REST based connection fails immediately > because Akka rejects non-Akka connection. However in Netty based RPC, the > REST client seems to wait for the response indefinitely, thus making it > impossible to fail and try RPC. > The fix is quite simple - set a timeout on reading response from the server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11726) Legacy Netty-RPC based submission in standalone mode does not work
[ https://issues.apache.org/jira/browse/SPARK-11726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11726: -- Assignee: Jacek Lewandowski (was: Jacek Laskowski) > Legacy Netty-RPC based submission in standalone mode does not work > -- > > Key: SPARK-11726 > URL: https://issues.apache.org/jira/browse/SPARK-11726 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Reporter: Jacek Lewandowski >Assignee: Jacek Lewandowski > Fix For: 1.6.0 > > > When the application is to be submitted in cluster mode and standalone Spark > scheduler is used either legacy RPC based protocol or REST based protocol can > be used. Spark submit firstly tries REST and if it fails it tries RPC. > When Akka based RPC is used, the REST based connection fails immediately > because Akka rejects non-Akka connection. However in Netty based RPC, the > REST client seems to wait for the response indefinitely, thus making it > impossible to fail and try RPC. > The fix is quite simple - set a timeout on reading response from the server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes
[ https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11732: -- Target Version/s: 1.6.0 > MiMa excludes miss private classes > -- > > Key: SPARK-11732 > URL: https://issues.apache.org/jira/browse/SPARK-11732 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.1 >Reporter: Tim Hunter >Assignee: Tim Hunter > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The checks in GenerateMIMAIgnore only check for package private classes, not > private classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes
[ https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11732: -- Assignee: Tim Hunter > MiMa excludes miss private classes > -- > > Key: SPARK-11732 > URL: https://issues.apache.org/jira/browse/SPARK-11732 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.1 >Reporter: Tim Hunter >Assignee: Tim Hunter > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The checks in GenerateMIMAIgnore only check for package private classes, not > private classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11480) Wrong callsite is displayed when using AsyncRDDActions#takeAsync
[ https://issues.apache.org/jira/browse/SPARK-11480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11480. --- Resolution: Fixed Assignee: Kousuke Saruta Fix Version/s: 1.6.0 > Wrong callsite is displayed when using AsyncRDDActions#takeAsync > > > Key: SPARK-11480 > URL: https://issues.apache.org/jira/browse/SPARK-11480 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.6.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.6.0 > > > When we call AsyncRDDActions#takeAsync, actually another DAGScheduler#runJob > is called from another thread so we cannot get proper callsite infomation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11710) Document new memory management model
[ https://issues.apache.org/jira/browse/SPARK-11710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11710. --- Resolution: Fixed Fix Version/s: 1.6.0 > Document new memory management model > > > Key: SPARK-11710 > URL: https://issues.apache.org/jira/browse/SPARK-11710 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.6.0 > > > e.g. tuning guide still references old deprecated configs > https://spark.apache.org/docs/1.5.0/tuning.html#garbage-collection-tuning -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor
[ https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8029: - Target Version/s: 1.5.3, 1.6.0 (was: 1.5.2, 1.6.0) > ShuffleMapTasks must be robust to concurrent attempts on the same executor > -- > > Key: SPARK-8029 > URL: https://issues.apache.org/jira/browse/SPARK-8029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Imran Rashid >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3, 1.6.0 > > Attachments: > AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf > > > When stages get retried, a task may have more than one attempt running at the > same time, on the same executor. Currently this causes problems for > ShuffleMapTasks, since all attempts try to write to the same output files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7308) Should there be multiple concurrent attempts for one stage?
[ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7308: - Assignee: Davies Liu > Should there be multiple concurrent attempts for one stage? > --- > > Key: SPARK-7308 > URL: https://issues.apache.org/jira/browse/SPARK-7308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Imran Rashid >Assignee: Davies Liu > Fix For: 1.5.3, 1.6.0 > > Attachments: SPARK-7308_discussion.pdf > > > Currently, when there is a fetch failure, you can end up with multiple > concurrent attempts for the same stage. Is this intended? At best, it leads > to some very confusing behavior, and it makes it hard for the user to make > sense of what is going on. At worst, I think this is cause of some very > strange errors we've seen errors we've seen from users, where stages start > executing before all the dependent stages have completed. > This can happen in the following scenario: there is a fetch failure in > attempt 0, so the stage is retried. attempt 1 starts. But, tasks from > attempt 0 are still running -- some of them can also hit fetch failures after > attempt 1 starts. That will cause additional stage attempts to get fired up. > There is an attempt to handle this already > https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 > but that only checks whether the **stage** is running. It really should > check whether that **attempt** is still running, but there isn't enough info > to do that. > I'll also post some info on how to reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor
[ https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8029: - Description: When stages get retried, a task may have more than one attempt running at the same time, on the same executor. Currently this causes problems for ShuffleMapTasks, since all attempts try to write to the same output files. This is finally resolved through https://github.com/apache/spark/pull/9610, which uses the first writer wins approach. was: When stages get retried, a task may have more than one attempt running at the same time, on the same executor. Currently this causes problems for ShuffleMapTasks, since all attempts try to write to the same output files. This is resolved through > ShuffleMapTasks must be robust to concurrent attempts on the same executor > -- > > Key: SPARK-8029 > URL: https://issues.apache.org/jira/browse/SPARK-8029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Imran Rashid >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3, 1.6.0 > > Attachments: > AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf > > > When stages get retried, a task may have more than one attempt running at the > same time, on the same executor. Currently this causes problems for > ShuffleMapTasks, since all attempts try to write to the same output files. > This is finally resolved through https://github.com/apache/spark/pull/9610, > which uses the first writer wins approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7829) SortShuffleWriter writes inconsistent data & index files on stage retry
[ https://issues.apache.org/jira/browse/SPARK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-7829. -- Resolution: Fixed Assignee: Davies Liu (was: Imran Rashid) Fix Version/s: 1.6.0 1.5.3 Target Version/s: 1.5.3, 1.6.0 > SortShuffleWriter writes inconsistent data & index files on stage retry > --- > > Key: SPARK-7829 > URL: https://issues.apache.org/jira/browse/SPARK-7829 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.3.1 >Reporter: Imran Rashid >Assignee: Davies Liu > Fix For: 1.5.3, 1.6.0 > > > When a stage is retried, even if a shuffle map task was successful, it may > get retried in any case. If it happens to get scheduled on the same > executor, the old data file is *appended*, while the index file still assumes > the data starts in position 0. This leads to an apparently corrupt shuffle > map output, since when the data file is read, the index file points to the > wrong location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor
[ https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8029: - Description: When stages get retried, a task may have more than one attempt running at the same time, on the same executor. Currently this causes problems for ShuffleMapTasks, since all attempts try to write to the same output files. This is resolved through was:When stages get retried, a task may have more than one attempt running at the same time, on the same executor. Currently this causes problems for ShuffleMapTasks, since all attempts try to write to the same output files. > ShuffleMapTasks must be robust to concurrent attempts on the same executor > -- > > Key: SPARK-8029 > URL: https://issues.apache.org/jira/browse/SPARK-8029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Imran Rashid >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3, 1.6.0 > > Attachments: > AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf > > > When stages get retried, a task may have more than one attempt running at the > same time, on the same executor. Currently this causes problems for > ShuffleMapTasks, since all attempts try to write to the same output files. > This is resolved through -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7829) SortShuffleWriter writes inconsistent data & index files on stage retry
[ https://issues.apache.org/jira/browse/SPARK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004780#comment-15004780 ] Andrew Or commented on SPARK-7829: -- I believe this is now fixed due to https://github.com/apache/spark/pull/9610. Let me know if this is not the case. > SortShuffleWriter writes inconsistent data & index files on stage retry > --- > > Key: SPARK-7829 > URL: https://issues.apache.org/jira/browse/SPARK-7829 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.3.1 >Reporter: Imran Rashid >Assignee: Imran Rashid > Fix For: 1.5.3, 1.6.0 > > > When a stage is retried, even if a shuffle map task was successful, it may > get retried in any case. If it happens to get scheduled on the same > executor, the old data file is *appended*, while the index file still assumes > the data starts in position 0. This leads to an apparently corrupt shuffle > map output, since when the data file is read, the index file points to the > wrong location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor
[ https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8029: - Fix Version/s: (was: 1.5.2) 1.5.3 > ShuffleMapTasks must be robust to concurrent attempts on the same executor > -- > > Key: SPARK-8029 > URL: https://issues.apache.org/jira/browse/SPARK-8029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Imran Rashid >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3, 1.6.0 > > Attachments: > AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf > > > When stages get retried, a task may have more than one attempt running at the > same time, on the same executor. Currently this causes problems for > ShuffleMapTasks, since all attempts try to write to the same output files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor
[ https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8029: - Fix Version/s: 1.5.2 > ShuffleMapTasks must be robust to concurrent attempts on the same executor > -- > > Key: SPARK-8029 > URL: https://issues.apache.org/jira/browse/SPARK-8029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Imran Rashid >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.2, 1.6.0 > > Attachments: > AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf > > > When stages get retried, a task may have more than one attempt running at the > same time, on the same executor. Currently this causes problems for > ShuffleMapTasks, since all attempts try to write to the same output files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
[ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004874#comment-15004874 ] Andrew Or commented on SPARK-8582: -- Hi everyone, I have bumped this to 1.7.0 because of the potential performance regressions a fix could introduce. If you are affected by this and would like to solve this earlier, then you can workaround this by calling `persist` first before you call `checkpoint`. This ensures that the second time you compute the RDD reads from the cache instead, which is much faster for many workloads. > Optimize checkpointing to avoid computing an RDD twice > -- > > Key: SPARK-8582 > URL: https://issues.apache.org/jira/browse/SPARK-8582 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Shixiong Zhu > > In Spark, checkpointing allows the user to truncate the lineage of his RDD > and save the intermediate contents to HDFS for fault tolerance. However, this > is not currently implemented super efficiently: > Every time we checkpoint an RDD, we actually compute it twice: once during > the action that triggered the checkpointing in the first place, and once > while we checkpoint (we iterate through an RDD's partitions and write them to > disk). See this line for more detail: > https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. > Instead, we should have a `CheckpointingInterator` that writes checkpoint > data to HDFS while we run the action. This will speed up many usages of > `RDD#checkpoint` by 2X. > (Alternatively, the user can just cache the RDD before checkpointing it, but > this is not always viable for very large input data. It's also not a great > API to use in general.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
[ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8582: - Target Version/s: 1.7.0 (was: 1.6.0) > Optimize checkpointing to avoid computing an RDD twice > -- > > Key: SPARK-8582 > URL: https://issues.apache.org/jira/browse/SPARK-8582 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Shixiong Zhu > > In Spark, checkpointing allows the user to truncate the lineage of his RDD > and save the intermediate contents to HDFS for fault tolerance. However, this > is not currently implemented super efficiently: > Every time we checkpoint an RDD, we actually compute it twice: once during > the action that triggered the checkpointing in the first place, and once > while we checkpoint (we iterate through an RDD's partitions and write them to > disk). See this line for more detail: > https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. > Instead, we should have a `CheckpointingInterator` that writes checkpoint > data to HDFS while we run the action. This will speed up many usages of > `RDD#checkpoint` by 2X. > (Alternatively, the user can just cache the RDD before checkpointing it, but > this is not always viable for very large input data. It's also not a great > API to use in general.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7308) Should there be multiple concurrent attempts for one stage?
[ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004783#comment-15004783 ] Andrew Or commented on SPARK-7308: -- Should this still be open given that all associated JIRAs are closed? I think we've already established that there's no bullet-proof way to do this on the scheduler side so we need to make the write side robust. > Should there be multiple concurrent attempts for one stage? > --- > > Key: SPARK-7308 > URL: https://issues.apache.org/jira/browse/SPARK-7308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Imran Rashid >Assignee: Imran Rashid > Attachments: SPARK-7308_discussion.pdf > > > Currently, when there is a fetch failure, you can end up with multiple > concurrent attempts for the same stage. Is this intended? At best, it leads > to some very confusing behavior, and it makes it hard for the user to make > sense of what is going on. At worst, I think this is cause of some very > strange errors we've seen errors we've seen from users, where stages start > executing before all the dependent stages have completed. > This can happen in the following scenario: there is a fetch failure in > attempt 0, so the stage is retried. attempt 1 starts. But, tasks from > attempt 0 are still running -- some of them can also hit fetch failures after > attempt 1 starts. That will cause additional stage attempts to get fired up. > There is an attempt to handle this already > https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 > but that only checks whether the **stage** is running. It really should > check whether that **attempt** is still running, but there isn't enough info > to do that. > I'll also post some info on how to reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
[ https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-7970. -- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Optimize code for SQL queries fired on Union of RDDs (closure cleaner) > -- > > Key: SPARK-7970 > URL: https://issues.apache.org/jira/browse/SPARK-7970 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Nitin Goyal >Assignee: Nitin Goyal > Fix For: 1.6.0 > > Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot > 2015-05-27 at 11.07.02 pm.png > > > Closure cleaner slows down the execution of Spark SQL queries fired on union > of RDDs. The time increases linearly at driver side with number of RDDs > unioned. Refer following thread for more context :- > http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html > As can be seen in attached screenshots of Jprofiler, lot of time is getting > consumed in "getClassReader" method of ClosureCleaner and rest in > "ensureSerializable" (atleast in my case) > This can be fixed in two ways (as per my current understanding) :- > 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create > MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls > ClosureCleaner clean method (See PR - > https://github.com/apache/spark/pull/6256). > 2. Fix at Spark core level - > (i) Make "checkSerializable" property driven in SparkContext's clean method > (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11710) Document new memory management model
Andrew Or created SPARK-11710: - Summary: Document new memory management model Key: SPARK-11710 URL: https://issues.apache.org/jira/browse/SPARK-11710 Project: Spark Issue Type: Sub-task Components: Documentation, Spark Core Affects Versions: 1.6.0 Reporter: Andrew Or Assignee: Andrew Or e.g. tuning guide still references old deprecated configs https://spark.apache.org/docs/1.5.0/tuning.html#garbage-collection-tuning -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11658) simplify documentation for PySpark combineByKey
[ https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11658. --- Resolution: Fixed Assignee: chris snow Fix Version/s: 1.7.0 Target Version/s: 1.7.0 > simplify documentation for PySpark combineByKey > --- > > Key: SPARK-11658 > URL: https://issues.apache.org/jira/browse/SPARK-11658 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 1.5.1 >Reporter: chris snow >Assignee: chris snow >Priority: Minor > Fix For: 1.7.0 > > > The current documentation for combineByKey looks like this: > {code} > >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) > >>> def f(x): return x > >>> def add(a, b): return a + str(b) > >>> sorted(x.combineByKey(str, add, add).collect()) > [('a', '11'), ('b', '1')] > """ > {code} > I think it could be simplified to: > {code} > >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) > >>> def add(a, b): return a + str(b) > >>> x.combineByKey(str, add, add).collect() > [('a', '11'), ('b', '1')] > """ > {code} > I'll shortly add a patch for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2533: - Assignee: Jean-Baptiste Onofré > Show summary of locality level of completed tasks in the each stage page of > web UI > -- > > Key: SPARK-2533 > URL: https://issues.apache.org/jira/browse/SPARK-2533 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Masayoshi TSUZUKI >Assignee: Jean-Baptiste Onofré >Priority: Minor > Fix For: 1.6.0 > > > When the number of tasks is very large, it is impossible to know how many > tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the > stage page of web UI. It would be better to show the summary of task locality > level in web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11671) Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
[ https://issues.apache.org/jira/browse/SPARK-11671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11671: -- Assignee: chris snow > Example for sqlContext.createDataDrame from pandas.DataFrame has a typo > --- > > Key: SPARK-11671 > URL: https://issues.apache.org/jira/browse/SPARK-11671 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 1.5.1 >Reporter: chris snow >Assignee: chris snow >Priority: Minor > Fix For: 1.7.0 > > > PySpark documentation error: > {code} > sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) > {code} > Results in: > {code} > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) > /usr/local/src/bluemix_ipythonspark_141/notebook/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/generic.pyc > in __getattr__(self, name) >1841 return self[name] >1842 raise AttributeError("'%s' object has no attribute '%s'" % > -> 1843 (type(self).__name__, name)) >1844 >1845 def __setattr__(self, name, value): > AttributeError: 'DataFrame' object has no attribute 'collect' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11671) Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
[ https://issues.apache.org/jira/browse/SPARK-11671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11671: -- Fix Version/s: (was: 1.7.0) 1.6.0 > Example for sqlContext.createDataDrame from pandas.DataFrame has a typo > --- > > Key: SPARK-11671 > URL: https://issues.apache.org/jira/browse/SPARK-11671 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 1.5.1 >Reporter: chris snow >Assignee: chris snow >Priority: Minor > Fix For: 1.6.0 > > > PySpark documentation error: > {code} > sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) > {code} > Results in: > {code} > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) > /usr/local/src/bluemix_ipythonspark_141/notebook/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/generic.pyc > in __getattr__(self, name) >1841 return self[name] >1842 raise AttributeError("'%s' object has no attribute '%s'" % > -> 1843 (type(self).__name__, name)) >1844 >1845 def __setattr__(self, name, value): > AttributeError: 'DataFrame' object has no attribute 'collect' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11658) simplify documentation for PySpark combineByKey
[ https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11658: -- Fix Version/s: (was: 1.7.0) 1.6.0 > simplify documentation for PySpark combineByKey > --- > > Key: SPARK-11658 > URL: https://issues.apache.org/jira/browse/SPARK-11658 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 1.5.1 >Reporter: chris snow >Assignee: chris snow >Priority: Minor > Fix For: 1.6.0 > > > The current documentation for combineByKey looks like this: > {code} > >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) > >>> def f(x): return x > >>> def add(a, b): return a + str(b) > >>> sorted(x.combineByKey(str, add, add).collect()) > [('a', '11'), ('b', '1')] > """ > {code} > I think it could be simplified to: > {code} > >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) > >>> def add(a, b): return a + str(b) > >>> x.combineByKey(str, add, add).collect() > [('a', '11'), ('b', '1')] > """ > {code} > I'll shortly add a patch for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org