[jira] [Updated] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4598: -- Affects Version/s: 1.2.0 Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228677#comment-14228677 ] Josh Rosen commented on SPARK-4598: --- I was able to reproduce this issue using the SparkPi example. I captured a heap dump in YourKit and it looks like the raw, uncompressed HTML of the Stage page is over 75 megabytes and the Scala XML tree corresponding to the page is hundreds of megabytes (~200). The actual HTML itself should be highly compressible, since it contains a lot of redundancy. In the longer-term, we could also explore approaches that perform more of the rendering / formatting in the browser using Javascript; this would allow us to send the task table data as JSON or CSV, which would contain much less redundancy; we could also avoid the overheads of the XML library. As as shorter-term hack, though, I wonder whether there's some trick to reduce the overall memory usage of the intermediate scala.xml data structures, since it seems odd that we end up materializing such a large object graph when it seems like large portions of it could be lazily streamed. Maybe there's some simple trick where sprinkling in a few {{.iterator}} calls would improve things. Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228684#comment-14228684 ] Josh Rosen commented on SPARK-4598: --- Actually, it might be pretty hard to trim down the memory usage via scala.xml tricks. Adding some functionality to return the stage table information as CSV data might be a cleaner way to handle this. This doesn't necessarily imply using AJAX requests to load the data from the backend; we could just dump the CSV data into a script tag and load it via Javascript. We might be able to hide all of this complexity behind the StageTableBase class, so we could this change to a small section of the code. Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4652) Add docs about spark-git-repo option
Kai Sasaki created SPARK-4652: - Summary: Add docs about spark-git-repo option Key: SPARK-4652 URL: https://issues.apache.org/jira/browse/SPARK-4652 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Minor It was a little hard to understand how to use --spark-git-repo option on spark-ec2 script. Some additional documentation might be needed to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4652) Add docs about spark-git-repo option
[ https://issues.apache.org/jira/browse/SPARK-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228789#comment-14228789 ] Apache Spark commented on SPARK-4652: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/3513 Add docs about spark-git-repo option Key: SPARK-4652 URL: https://issues.apache.org/jira/browse/SPARK-4652 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Minor It was a little hard to understand how to use --spark-git-repo option on spark-ec2 script. Some additional documentation might be needed to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4082) Show Waiting/Queued Stages in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228820#comment-14228820 ] Patrick Wendell commented on SPARK-4082: IMO this is sufficiently addressed by the jobs page. Or at least, now that we'd have the jobs page I'd be interested in seeing if people still feel a big need for pending stages in the stage page. Show Waiting/Queued Stages in Spark UI -- Key: SPARK-4082 URL: https://issues.apache.org/jira/browse/SPARK-4082 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Pat McDonough In the Stages UI page, It would be helpful to show the user any stages the DAGScheduler has planned but are not yet active. Currently, this info is not shown to the user in any way. /CC [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4644) Implement skewed join
[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228839#comment-14228839 ] Aaron Davidson commented on SPARK-4644: --- [~zsxwing] I believe that this problem is related more fundamentally to the problem that Spark currently requires that all values for the same key remain in memory. Your solution aims to fix this for the specific case of joins, but I wonder if we generalize it, if we could solve this for things like groupBy as well. I don't have a fully fleshed out idea yet, but I was considering a model where there are 2 types of shuffles: aggregation-based and rearrangement-based. Aggregation-based shuffles use partial aggregation and combiners to form and merge (K, C) pairs. Rearrangement-based shuffles do not expect a decrease in the amount of total data, however, and so my thought is that this model does not make sense. Instead, we could provide an interface similar to ExternalAppendOnlyMap but which returns an Iterator[(K, Iterable[V])] pairs, with some extra semantics related to the Iterable[V]s (such as having a .chunkedIterator() method which enables block nested loops join). In this model, join could be implemented by mapping the left side's key to (K, 1) and the right side to (K, 2) and having logic which reads from two adjacent value-iterables simultaneously -- e.g., val ((k, 1), left: Iterable[V]) = map.next() val ((k, 2), right: Iterable[V]) = map.next() // perform merge using the left and right iterators. Implement skewed join - Key: SPARK-4644 URL: https://issues.apache.org/jira/browse/SPARK-4644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Attachments: Skewed Join Design Doc.pdf Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Description: Support Coalesce function in Spark SQL. Support type widening in Coalesce function. And replace Coalesce UDF in Spark Hive with local Coalesce function since it is memory efficient and faster. was:Support Coalesce function in Spark SQL Support Coalesce in Spark SQL. -- Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support Coalesce function in Spark SQL. Support type widening in Coalesce function. And replace Coalesce UDF in Spark Hive with local Coalesce function since it is memory efficient and faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support COALESCE function in Spark SQL and HiveQL
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Summary: Support COALESCE function in Spark SQL and HiveQL (was: Support Coalesce in Spark SQL.) Support COALESCE function in Spark SQL and HiveQL - Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support Coalesce function in Spark SQL. Support type widening in Coalesce function. And replace Coalesce UDF in Spark Hive with local Coalesce function since it is memory efficient and faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4653) DAGScheduler refactoring and cleanup
Josh Rosen created SPARK-4653: - Summary: DAGScheduler refactoring and cleanup Key: SPARK-4653 URL: https://issues.apache.org/jira/browse/SPARK-4653 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen This is an umbrella JIRA for DAGScheduler refactoring and cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4654) Clean up DAGScheduler's getMissingParentStages() and stageDependsOn() methods
Josh Rosen created SPARK-4654: - Summary: Clean up DAGScheduler's getMissingParentStages() and stageDependsOn() methods Key: SPARK-4654 URL: https://issues.apache.org/jira/browse/SPARK-4654 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen DAGScheduler has {{getMissingParentStages()}} and {{stageDependsOn()}} methods, which are suspiciously similar to {{getParentStages()}}. All of these methods perform traversal of the RDD / Stage graph to inspect parent stages. We can remove both of these methods, though: the set of parent stages is known when a {{Stage}} instance is constructed and is already stored in {{Stage.parents}}, so we can just check for missing stages by looking for unavailable stages in {{Stage.parents}}. Similarly, we can determine whether one stage depends on another by searching {{Stage.parents}} rather than performing the entire graph traversal from scratch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses
Josh Rosen created SPARK-4655: - Summary: Split Stage into ShuffleMapStage and ResultStage subclasses Key: SPARK-4655 URL: https://issues.apache.org/jira/browse/SPARK-4655 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen The scheduler's {{Stage}} class has many fields which are only applicable to result stages or shuffle map stages. As a result, I think that it makes sense to make {{Stage}} into an abstract base class with two subclasses, {{ResultStage}} and {{ShuffleMapStage}}. This would improve the understandability of the DAGScheduler code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4653) DAGScheduler refactoring and cleanup
[ https://issues.apache.org/jira/browse/SPARK-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-4653: - Assignee: Josh Rosen DAGScheduler refactoring and cleanup Key: SPARK-4653 URL: https://issues.apache.org/jira/browse/SPARK-4653 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen This is an umbrella JIRA for DAGScheduler refactoring and cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4654) Clean up DAGScheduler's getMissingParentStages() and stageDependsOn() methods
[ https://issues.apache.org/jira/browse/SPARK-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228919#comment-14228919 ] Apache Spark commented on SPARK-4654: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3515 Clean up DAGScheduler's getMissingParentStages() and stageDependsOn() methods - Key: SPARK-4654 URL: https://issues.apache.org/jira/browse/SPARK-4654 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen DAGScheduler has {{getMissingParentStages()}} and {{stageDependsOn()}} methods, which are suspiciously similar to {{getParentStages()}}. All of these methods perform traversal of the RDD / Stage graph to inspect parent stages. We can remove both of these methods, though: the set of parent stages is known when a {{Stage}} instance is constructed and is already stored in {{Stage.parents}}, so we can just check for missing stages by looking for unavailable stages in {{Stage.parents}}. Similarly, we can determine whether one stage depends on another by searching {{Stage.parents}} rather than performing the entire graph traversal from scratch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4622) Add the some error infomation if using spark-sql in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4622. --- Resolution: Duplicate Add the some error infomation if using spark-sql in yarn-cluster mode - Key: SPARK-4622 URL: https://issues.apache.org/jira/browse/SPARK-4622 Project: Spark Issue Type: Bug Components: Deploy Reporter: carlmartin If using spark-sql in yarn-cluster mode, print an error infomation just as the spark shell in yarn-cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228937#comment-14228937 ] Josh Rosen commented on SPARK-4498: --- Hi [~airhorns], I finally got a chance to look into this and, based on reading the code, I have a theory about what might be happening. If you look at the [current Master.scala file|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L668], you'll notice that there are only two situations where the standalone Master removes applications: - The master receives a DisassociatedEvent due to the application actor shutting down and calls {{finishApplication}}. - An executor exited with a non-zero exit status and the maximum number of executor failures has been succeeded. Now, imagine that for some reason the standalone Master does not receive a DisassociatedEvent. When executors eventually start to die, the standalone master will discover this via ExecutorStateChanged. If it hasn't hit the maximum number of executor failures, [it will attempt to re-schedule the application|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L325] and obtain new resources. If a new executor is granted, this will [cause the maximum failed executors count to reset to zero|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L313], leading to a sort of livelock behavior where executors die because they can't contact the application but keep being launched because executors keep entering the ExecutorState.RUNNING state ([it looks like|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L148] executors transition to this state when they launch, not once they've registered with the driver). It looks like the line {code} if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() } {code} was introduced in SPARK-2425. It looks like this was introduced after the earliest commit that you mentioned, so it seems like this is be a regression in 1.2.0. I don't think that we should revert SPARK-2425, since that fixes another fairly important bug. Instead, I'd like to try to figure out how an application could fail without a DisassociatedEvent causing it to be removed. Could this be due to our use of non-standard Akka timeout / failure detector settings? I would think that we'd still get a DisassociatedEvent when a network connection was closed or something. Maybe we could switch to relying on our own explicit heartbeats for failure detection, like we do elsewhere in Spark. [~markhamstra], do you have any ideas here? Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Critical Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification
[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4498: -- Priority: Blocker (was: Critical) Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4498: -- Affects Version/s: 1.1.1 Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4498: -- Target Version/s: 1.2.0, 1.1.2 Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228943#comment-14228943 ] Josh Rosen commented on SPARK-4498: --- Adding 1.1.1 as an affected version, too, since SPARK-2425 was backported to that release, too. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException:
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228947#comment-14228947 ] Mark Hamstra commented on SPARK-4498: - On a quick look-through, your analysis looks likely to be correct, [~joshrosen]. Making sure that failed applications are always accompanied by a DisassociatedEvent would be a good thing. The belt-and-suspenders fix would be to also change the executor state-change semantics so that either RUNNING means not just that the executor process is running, but also that it has successfully connected to the application, or else introduce an additional executor state (perhaps REGISTERED) along with state transitions and finer-grained state logic controlling executor restart and application removal. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228952#comment-14228952 ] Josh Rosen commented on SPARK-4498: --- In addition to exploring the missing DisassociatedEvent theory, it might also be worthwhile to brainstorm whether problems at other steps in the cleanup process could cause an application to fail to be removed. I'm not sure that a single missing DisassociatedEvent could explain the blip behavior observed here, where an entire group of applications fail to be marked as completed / failed. In the DisassociatedEvent handler, we index into {{addressToApp}} to determine which app corresponded to the DisassociatedEvent: {code} case DisassociatedEvent(_, address, _) = { // The disconnected client could've been either a worker or an app; remove whichever it was logInfo(s$address got disassociated, removing it.) addressToWorker.get(address).foreach(removeWorker) addressToApp.get(address).foreach(finishApplication) if (state == RecoveryState.RECOVERING canCompleteRecovery) { completeRecovery() } } {code} If the {{addressToApp}} entry was empty / wrong, then we wouldn't properly clean up the app. However, I don't think that there should be any problems here because each application actor system should have its own distinct address and Akka's {{Address}} class properly implements hashCode / equals. Even if drivers run on the same host, their actor systems should have different port numbers. Continuing along: {code} def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) { if (apps.contains(app)) { logInfo(Removing app + app.id) {code} Is there any way that the {{apps}} HashSet could fail to contain {{app}}? I don't think so: {{ApplicationInfo}} doesn't override equals/hashCode, but I don't think that's a problem since we only create one ApplicationInfo per app, so the default object identity comparison should be fine. We should probably log an error if we call {{removeApplication}} on an application that has already been removed, though. (Also, why do we need the {{apps}} HashSet when we could just use {{idToApp.values}}?) Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228966#comment-14228966 ] Josh Rosen commented on SPARK-4498: --- Here's an interesting pattern to grep for in all-master-logs-around-blip.txt: {{sparkdri...@spark-etl1.chi.shopify.com:52047}}. Note that this log is in reverse-chronological order. The earliest occurrence is in a DisassociatedEvent log message: {code} 14-11-19_18:48:31.34508 14/11/19 18:48:31 ERROR EndpointWriter: AssociationError [akka.tcp://sparkmas...@dn05.chi.shopify.com:7077] - [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047]: Error [Shut down address: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047] [ 2014-11-19_18:48:31.34510 akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047 2014-11-19_18:48:31.34511 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. 2014-11-19_18:48:31.34512 ] 2014-11-19_18:48:31.34521 14/11/19 18:48:31 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.16.126.88%3A48040-1355#-59270061] was not delivered. [2859] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2014-11-19_18:48:31.34603 14/11/19 18:48:31 INFO Master: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047 got disassociated, removing it. 2014-11-19_18:48:31.20255 14/11/19 18:48:31 INFO Master: Removing executor app-20141119184815-1316/7 because it is EXITED {code} Even though INFO-level logging is enabled, there's no INFO: master: Removing app message near this event. The entire log contains many repetitions of this same DisassociatedEvent log. The same log also contains many executors that launch and immediately fail: {code} 2014-11-19_18:52:51.84000 14/11/19 18:52:51 INFO Master: Launching executor app-20141119184815-1313/75 on worker worker-20141118172622-dn19.chi.shopify.com-38498 2014-11-19_18:52:51.83981 14/11/19 18:52:51 INFO Master: Removing executor app-20141119184815-1313/67 because it is EXITED {code} I couldn't find a {{removing app app-20141119184815-1313}} event. Another interesting thing: even though it looks like this log contains information for 39 drivers, there are 100 disassociated events: {code} [joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | uniq | wc -l 39 [joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | wc -l 100 {code} Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard
[jira] [Resolved] (SPARK-4057) Use -agentlib instead of -Xdebug in sbt-launch-lib.bash for debugging
[ https://issues.apache.org/jira/browse/SPARK-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4057. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Kousuke Saruta Use -agentlib instead of -Xdebug in sbt-launch-lib.bash for debugging -- Key: SPARK-4057 URL: https://issues.apache.org/jira/browse/SPARK-4057 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Trivial Fix For: 1.3.0 In -launch-lib.bash, -Xdebug option is used for debugging. We should use -agentlib option for Java 6+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4505) Reduce the memory usage of CompactBuffer[T] when T is a primitive type
[ https://issues.apache.org/jira/browse/SPARK-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4505. Resolution: Fixed Fix Version/s: 1.3.0 Reduce the memory usage of CompactBuffer[T] when T is a primitive type -- Key: SPARK-4505 URL: https://issues.apache.org/jira/browse/SPARK-4505 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.3.0 If CompactBuffer has a ClassTag parameter, CompactBuffer can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4505) Reduce the memory usage of CompactBuffer[T] when T is a primitive type
[ https://issues.apache.org/jira/browse/SPARK-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4505: --- Assignee: Shixiong Zhu Reduce the memory usage of CompactBuffer[T] when T is a primitive type -- Key: SPARK-4505 URL: https://issues.apache.org/jira/browse/SPARK-4505 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.3.0 If CompactBuffer has a ClassTag parameter, CompactBuffer can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4628) Put external projects and examples behind a build flag
[ https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4628: --- Summary: Put external projects and examples behind a build flag (was: Put all external projects behind a build flag) Put external projects and examples behind a build flag -- Key: SPARK-4628 URL: https://issues.apache.org/jira/browse/SPARK-4628 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Priority: Blocker This is something we talked about doing for convenience, but I'm escalating this based on realizing today that some of our external projects depend on code that is not in maven central. I.e. if one of these dependencies is taken down (as happened recently with mqtt), all Spark builds will fail. The proposal here is simple, have a profile -Pexternal-projects that enables these. This can follow the exact pattern of -Pkinesis-asl which was disabled by default due to a license issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4656) Typo in Programming Guide markdown
Kai Sasaki created SPARK-4656: - Summary: Typo in Programming Guide markdown Key: SPARK-4656 URL: https://issues.apache.org/jira/browse/SPARK-4656 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Trivial Grammatical error in Programming Guide document -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4656) Typo in Programming Guide markdown
[ https://issues.apache.org/jira/browse/SPARK-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228999#comment-14228999 ] Apache Spark commented on SPARK-4656: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/3412 Typo in Programming Guide markdown -- Key: SPARK-4656 URL: https://issues.apache.org/jira/browse/SPARK-4656 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Trivial Grammatical error in Programming Guide document -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4656) Typo in Programming Guide markdown
[ https://issues.apache.org/jira/browse/SPARK-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228998#comment-14228998 ] Kai Sasaki commented on SPARK-4656: --- Created the patch. Please review it. https://github.com/apache/spark/pull/3412 Typo in Programming Guide markdown -- Key: SPARK-4656 URL: https://issues.apache.org/jira/browse/SPARK-4656 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Trivial Grammatical error in Programming Guide document -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4657) RuntimeException: Unsupported datatype DecimalType()
pengyanhong created SPARK-4657: -- Summary: RuntimeException: Unsupported datatype DecimalType() Key: SPARK-4657 URL: https://issues.apache.org/jira/browse/SPARK-4657 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: pengyanhong java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151) at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33) at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61) at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59) at com.jd.jddp.spark.hive.Cache.main(Cache.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4657) RuntimeException: Unsupported datatype DecimalType()
[ https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengyanhong updated SPARK-4657: --- Description: execute a query statement on a Hive table which contains decimal data type field, got error as below: {quote} java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151) at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33) at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61) at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59) at com.jd.jddp.spark.hive.Cache.main(Cache.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459) {quote} was: java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) at
[jira] [Updated] (SPARK-4657) RuntimeException: Unsupported datatype DecimalType()
[ https://issues.apache.org/jira/browse/SPARK-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengyanhong updated SPARK-4657: --- Description: execute a query statement on a Hive table which contains decimal data type field, than save the result into tachyon as parquet file, got error as below: {quote} java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151) at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) at com.jd.jddp.spark.hive.Cache$.cacheTable(Cache.scala:33) at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:61) at com.jd.jddp.spark.hive.Cache$$anonfun$main$5.apply(Cache.scala:59) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at com.jd.jddp.spark.hive.Cache$.main(Cache.scala:59) at com.jd.jddp.spark.hive.Cache.main(Cache.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:459) {quote} was: execute a query statement on a Hive table which contains decimal data type field, got error as below: {quote} java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at
[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229001#comment-14229001 ] Patrick Wendell commented on SPARK-4630: Hey Kos - before starting to work on the design for this feature, could you try to quantify how important this actually is for performance? I.e. give some examples at scale in some benchmarks or user workloads? Spark in general is much less sensitive to the number of partitions than other frameworks since the overhead of launching individual tasks is very small. For this reason in the past we specifically decided not to introduce this complexity into Spark. Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229001#comment-14229001 ] Patrick Wendell edited comment on SPARK-4630 at 11/30/14 3:29 AM: -- Hey Kos - before starting to work on the design for this feature, could you try to quantify how important this actually is for performance? I.e. give some examples at scale in some benchmarks or user workloads? Spark in general is much less sensitive to the number of partitions than other frameworks since the overhead of launching individual tasks is very small. For this reason in the past we specifically decided not to introduce too much complexity into Spark for this, but we did add some hueristics over time. It seems like the proposal here is to extend the heuristics a bit, so some simpler extensions might make sense. was (Author: pwendell): Hey Kos - before starting to work on the design for this feature, could you try to quantify how important this actually is for performance? I.e. give some examples at scale in some benchmarks or user workloads? Spark in general is much less sensitive to the number of partitions than other frameworks since the overhead of launching individual tasks is very small. For this reason in the past we specifically decided not to introduce this complexity into Spark. Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4658) Code documentation issue in DDL of datasource
Ravindra Pesala created SPARK-4658: -- Summary: Code documentation issue in DDL of datasource Key: SPARK-4658 URL: https://issues.apache.org/jira/browse/SPARK-4658 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Ravindra Pesala Priority: Minor The syntax mentioned to create table for datasource in ddl.scala file is documented with wrong syntax like {code} /** * CREATE FOREIGN TEMPORARY TABLE avroTable * USING org.apache.spark.sql.avro * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro) */ {code} but the correct syntax is {code} /** * CREATE TEMPORARY TABLE avroTable * USING org.apache.spark.sql.avro * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro) */ {code} Wrong syntax is documented in newParquet.scala like {code} `CREATE TABLE ... USING org.apache.spark.sql.parquet`. {code} but the correct syntax is {code} `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet`. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4658) Code documentation issue in DDL of datasource
[ https://issues.apache.org/jira/browse/SPARK-4658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229005#comment-14229005 ] Apache Spark commented on SPARK-4658: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/3516 Code documentation issue in DDL of datasource - Key: SPARK-4658 URL: https://issues.apache.org/jira/browse/SPARK-4658 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Ravindra Pesala Priority: Minor The syntax mentioned to create table for datasource in ddl.scala file is documented with wrong syntax like {code} /** * CREATE FOREIGN TEMPORARY TABLE avroTable * USING org.apache.spark.sql.avro * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro) */ {code} but the correct syntax is {code} /** * CREATE TEMPORARY TABLE avroTable * USING org.apache.spark.sql.avro * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro) */ {code} Wrong syntax is documented in newParquet.scala like {code} `CREATE TABLE ... USING org.apache.spark.sql.parquet`. {code} but the correct syntax is {code} `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet`. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4507) PR merge script should support closing multiple JIRA tickets
[ https://issues.apache.org/jira/browse/SPARK-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4507. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Takayuki Hasegawa PR merge script should support closing multiple JIRA tickets Key: SPARK-4507 URL: https://issues.apache.org/jira/browse/SPARK-4507 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Josh Rosen Assignee: Takayuki Hasegawa Priority: Minor Labels: starter Fix For: 1.3.0 For pull requests that reference multiple JIRAs in their titles, it would be helpful if the PR merge script offered to close all of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4543) Javadoc failure for network-common causes publish-local to fail
[ https://issues.apache.org/jira/browse/SPARK-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4543. Resolution: Duplicate This turned out to be an instance of SPARK-4193. Javadoc failure for network-common causes publish-local to fail --- Key: SPARK-4543 URL: https://issues.apache.org/jira/browse/SPARK-4543 Project: Spark Issue Type: Bug Components: Build, Documentation, Spark Core Reporter: Pedro Rodriguez Priority: Blocker Javadoc for network-common fails. This causes sbt publish-local to fail, and not deliver the spark-network-common jar/module to the local publish location. This in turns causes applications which are assembled using locally linked Spark to fail building. Steps: 1. Checkout master branch 2. sbt publish-local 3. Note javadoc errors 4. Navigate to ~./ivy2/local/org.apache.spark or equivalent 5. spark-network-common_2.10 should be missing 6. Building application with sbt assemble against local linked spark then fails since network-common is missing. I confirmed the problem is the javadoc compilation failure by fixing all javadoc errors with placeholder TODOs. Confirmed fix by running the above steps which work correctly. Pull Request: https://github.com/apache/spark/pull/3405 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4101) [MLLIB] Improve API in Word2Vec model
[ https://issues.apache.org/jira/browse/SPARK-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-4101: Comment: was deleted (was: If no-one is working on this I would be happy to knock this out. Thanks! ) [MLLIB] Improve API in Word2Vec model - Key: SPARK-4101 URL: https://issues.apache.org/jira/browse/SPARK-4101 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Peter Rudenko Priority: Minor 1) Would be nice to be able to retrieve underlying model map, to be able to work with it after (make an RDD, persist/load, online train, etc.). 2) Add analogyWords(w1: String, w2: String, target: String, num: Int) method, which returns n words that relates to target as w1 to w2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4659) Implement K-core decomposition algorithm
Xiaoming Li created SPARK-4659: -- Summary: Implement K-core decomposition algorithm Key: SPARK-4659 URL: https://issues.apache.org/jira/browse/SPARK-4659 Project: Spark Issue Type: New Feature Components: Examples, GraphX Affects Versions: 1.0.2 Reporter: Xiaoming Li I found that Graphx has no algorithm for *K-core/K-shell decomposition algorithm* yet.Based on the paper[http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6189336],I have Implemented the k-core decomposition algorithm, which can be used for analyzing large scale complex networks.Compared with traditional K-core decomposition algorithm, it transfers the value of K-shell by messages, instead of restructuring the topology iteratively, to reduce the iteration times enormously. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org