[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021115#comment-14021115 ] Matei Zaharia commented on SPARK-2044: -- Alright so I've posted my code at https://github.com/apache/spark/pull/1009. There are still two things missing: * Moving MapOutputTracker behind this interface * Moving aggregation into the ShuffleReaders and ShuffleWriters instead of having it inside RDD operations Maybe we can open those as separate JIRAs and more people can work on them. > Pluggable interface for shuffles > > > Key: SPARK-2044 > URL: https://issues.apache.org/jira/browse/SPARK-2044 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Matei Zaharia >Assignee: Matei Zaharia > Attachments: Pluggableshuffleproposal.pdf > > > Given that a lot of the current activity in Spark Core is in shuffles, I > wanted to propose factoring out shuffle implementations in a way that will > make experimentation easier. Ideally we will converge on one implementation, > but for a while, this could also be used to have several implementations > coexist. I'm suggesting this because I aware of at least three efforts to > look at shuffle (from Yahoo!, Intel and Databricks). Some of the things > people are investigating are: > * Push-based shuffle where data moves directly from mappers to reducers > * Sorting-based instead of hash-based shuffle, to create fewer files (helps a > lot with file handles and memory usage on large shuffles) > * External spilling within a key > * Changing the level of parallelism or even algorithm for downstream stages > at runtime based on statistics of the map output (this is a thing we had > prototyped in the Shark research project but never merged in core) > I've attached a design doc with a proposed interface. It's not too crazy > because the interface between shuffles and the rest of the code is already > pretty narrow (just some iterators for reading data and a writer interface > for writing it). Bigger changes will be needed in the interaction with > DAGScheduler and BlockManager for some of the ideas above, but we can handle > those separately, and this interface will allow us to experiment with some > short-term stuff sooner. > If things go well I'd also like to send a sort-based shuffle implementation > for 1.1, but we'll see how the timing on that works out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-938) OpenStack Swift Storage Support
[ https://issues.apache.org/jira/browse/SPARK-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021114#comment-14021114 ] Gil Vernik commented on SPARK-938: -- I am working on it. About to submit the patch. Almost done. > OpenStack Swift Storage Support > --- > > Key: SPARK-938 > URL: https://issues.apache.org/jira/browse/SPARK-938 > Project: Spark > Issue Type: New Feature > Components: Documentation, Examples, Input/Output, Spark Core >Affects Versions: 0.8.1 >Reporter: Murali Raju >Priority: Minor > > This issue is to track OpenStack Swift Storage support (development in > progress) in addition to S3 for Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021020#comment-14021020 ] Mridul Muralidharan commented on SPARK-2064: Ah, I assumed there was a disconnect. So in yarn, log aggregation means we don't care about actual nodes where container was run (and we don't have access to compute nodes anyway usually). What about listing only nodes and not executors ? Will that help ? > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2067) Spark logo in application UI uses absolute path
[ https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2067: --- Target Version/s: 1.0.1, 1.1.0 > Spark logo in application UI uses absolute path > --- > > Key: SPARK-2067 > URL: https://issues.apache.org/jira/browse/SPARK-2067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Trivial > > Link of the Spark logo in application UI (top left corner) is hard coded to > "/", and points to the wrong page when running with YARN proxy. Should use > uiRoot instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021018#comment-14021018 ] Reynold Xin commented on SPARK-2064: One thing is we can help identify executors that are dead, which is often important for debugging (finding out why they are dead - maybe disk space full resulting system irresponsive, etc). It is often also very useful information to have for spot instances on EC2 where executors might just die. If memory is the problem, we can cap the number of dead executors the UI tracks; alternatively, we can put the list of dead executors onto external storage (a sqlite database or even just text file in the log directory). > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2056) Set RDD name to input path
[ https://issues.apache.org/jira/browse/SPARK-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2056: --- Assignee: Neville Li > Set RDD name to input path > -- > > Key: SPARK-2056 > URL: https://issues.apache.org/jira/browse/SPARK-2056 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Neville Li >Assignee: Neville Li >Priority: Trivial > Fix For: 1.1.0 > > > RDDs have no names by default. Setting them to input path after opening from > file system makes it easier to understand job performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2056) Set RDD name to input path
[ https://issues.apache.org/jira/browse/SPARK-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2056. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 992 [https://github.com/apache/spark/pull/992] > Set RDD name to input path > -- > > Key: SPARK-2056 > URL: https://issues.apache.org/jira/browse/SPARK-2056 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Neville Li >Priority: Trivial > Fix For: 1.1.0 > > > RDDs have no names by default. Setting them to input path after opening from > file system makes it easier to understand job performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021011#comment-14021011 ] Mridul Muralidharan commented on SPARK-2064: I am probably missing the intent behind this change. What is the expected use case it is supposed to help with ? > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021008#comment-14021008 ] Mridul Muralidharan commented on SPARK-2064: Unfortunately OOM is a very big issue for us since application master is single point of failure when running in yarn. Particularly when memory is constrained and vigorously enforced by the yarn containers (requiring higher overheads to be specified reducing usable memory even further. Given this, and given the fair churn already for executor containers, I am hesitant about features which add to the memory footprint for UI even further. The cumulative impact of ui is nontrivial as I mentioned before. This, for example, would require 1-8% of master memory when there is reasonable churn for long running jobs (30 hours) on reasonable number of executors (200-300). > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020976#comment-14020976 ] Patrick Wendell edited comment on SPARK-2064 at 6/7/14 9:16 PM: I don't think OOM is an issue here - but I think this used to be the behavior and users requested that we clean up the old executors because otherwise for a long running service you get a really large list. Maybe we should have a TTL and remove dead executors after that time. was (Author: pwendell): I don't think OOM is an issue here - but I think this used to be the behavior and users requested that we clean up the old executors because otherwise for a long running service you get a really large list. Maybe we should have a timeout. > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020976#comment-14020976 ] Patrick Wendell commented on SPARK-2064: I don't think OOM is an issue here - but I think this used to be the behavior and users requested that we clean up the old executors because otherwise for a long running service you get a really large list. Maybe we should have a timeout. > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020971#comment-14020971 ] Reynold Xin commented on SPARK-2064: My estimation gave you a very high bound. How often does your cluster churn every single node 100 times? > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2059) Unresolved Attributes should cause a failure before execution time
[ https://issues.apache.org/jira/browse/SPARK-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2059: Assignee: Cheng Lian > Unresolved Attributes should cause a failure before execution time > -- > > Key: SPARK-2059 > URL: https://issues.apache.org/jira/browse/SPARK-2059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Cheng Lian > > Here's a partial solution: > https://github.com/marmbrus/spark/tree/analysisChecks -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types
[ https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2063: Assignee: Cheng Lian (was: Michael Armbrust) > Creating a SchemaRDD via sql() does not correctly resolve nested types > -- > > Key: SPARK-2063 > URL: https://issues.apache.org/jira/browse/SPARK-2063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Assignee: Cheng Lian > > For example, from the typical twitter dataset: > {code} > scala> val popularTweets = sql("SELECT retweeted_status.text, > MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status > is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30") > scala> popularTweets.toString > 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for > batch MultiInstanceRelations > 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for > batch CaseInsensitiveAttributeReferences > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'retweeted_status.text > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.sc
[jira] [Updated] (SPARK-2053) Add Catalyst expression for CASE WHEN
[ https://issues.apache.org/jira/browse/SPARK-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2053: Assignee: Zongheng Yang > Add Catalyst expression for CASE WHEN > - > > Key: SPARK-2053 > URL: https://issues.apache.org/jira/browse/SPARK-2053 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Zongheng Yang > > Here's a rough start: > https://github.com/marmbrus/spark/commit/1209daaf49b0a87e7f68f89c79d02b446e624db3 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2060) Querying JSON Datasets with SQL and DSL in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2060: Assignee: Yin Huai > Querying JSON Datasets with SQL and DSL in Spark SQL > > > Key: SPARK-2060 > URL: https://issues.apache.org/jira/browse/SPARK-2060 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types
[ https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020944#comment-14020944 ] Michael Armbrust commented on SPARK-2063: - Here is a reproducible test case: {code} case class TableName(tableName: String) TestSQLContext.sparkContext.parallelize(TableName("test") :: Nil).registerAsTable("tableName") case class NestedData(a: String) case class TopLevelRecord(n: NestedData) val nestedData = TestSQLContext.sparkContext.parallelize( TopLevelRecord(NestedData("value1")) :: TopLevelRecord(NestedData("value2")) :: Nil) nestedData.registerAsTable("nestedData") test("nested data") { val query1 = sql("SELECT n, n.a FROM nestedData GROUP BY a ORDER BY a LIMIT 10") //query1.collect() val query2 = query1.select('a) checkAnswer( query2, "test") } {code} > Creating a SchemaRDD via sql() does not correctly resolve nested types > -- > > Key: SPARK-2063 > URL: https://issues.apache.org/jira/browse/SPARK-2063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Assignee: Michael Armbrust > > For example, from the typical twitter dataset: > {code} > scala> val popularTweets = sql("SELECT retweeted_status.text, > MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status > is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30") > scala> popularTweets.toString > 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for > batch MultiInstanceRelations > 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for > batch CaseInsensitiveAttributeReferences > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'retweeted_status.text > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala
[jira] [Created] (SPARK-2068) Remove other uses of @transient lazy val in physical plan nodes
Michael Armbrust created SPARK-2068: --- Summary: Remove other uses of @transient lazy val in physical plan nodes Key: SPARK-2068 URL: https://issues.apache.org/jira/browse/SPARK-2068 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Fix For: 1.1.0 [SPARK-1994] was caused by this, we fixed it there, but in general doing planning on the slaves breaks a lot of our assumptions and seems to cause concurrency problems -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020936#comment-14020936 ] Mridul Muralidharan commented on SPARK-2064: It is 100 MB (or more) of memory which could be used elsewhere. In our clusters, for example, the number of workers can be very high while the containers can be quite ephemeral when under load (and so lot of container losses); on other hand, memory per container is constrained to about 8 gig (lower when we account for overheads, etc). So the amount of working memory in master reduces : we are finding that UI and related codepath is one of the portions which seems to be occupying a lot of memory in the OOM dumps of master. > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2067) Spark logo in application UI uses absolute path
[ https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020932#comment-14020932 ] Neville Li commented on SPARK-2067: --- A simple fix: https://github.com/apache/spark/pull/1006 > Spark logo in application UI uses absolute path > --- > > Key: SPARK-2067 > URL: https://issues.apache.org/jira/browse/SPARK-2067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Trivial > > Link of the Spark logo in application UI (top left corner) is hard coded to > "/", and points to the wrong page when running with YARN proxy. Should use > uiRoot instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2067) Spark logo in application UI uses absolute path
Neville Li created SPARK-2067: - Summary: Spark logo in application UI uses absolute path Key: SPARK-2067 URL: https://issues.apache.org/jira/browse/SPARK-2067 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Neville Li Priority: Trivial Link of the Spark logo in application UI (top left corner) is hard coded to "/", and points to the wrong page when running with YARN proxy. Should use uiRoot instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020897#comment-14020897 ] Reynold Xin commented on SPARK-2064: Is memory really an issue here? On a 1000 node cluster, let's say we need 1KB to track each executor (should be more than enough), then we need 1MB to track all of them. In less than 100MB, we can crash & restart all of them 100 times. If it really becomes the problem perhaps we can clean dead ones after a certain time period. > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2065) Have spark-ec2 set EC2 instance names
[ https://issues.apache.org/jira/browse/SPARK-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020893#comment-14020893 ] Nicholas Chammas commented on SPARK-2065: - Sure, I'd love to. > Have spark-ec2 set EC2 instance names > - > > Key: SPARK-2065 > URL: https://issues.apache.org/jira/browse/SPARK-2065 > Project: Spark > Issue Type: Improvement > Components: EC2 >Affects Versions: 1.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > {{spark-ec2}} launches EC2 instances with no names. It would be nice if it > gave each instance it launched a descriptive name. > I suggest: > {code} > spark-{spark-cluster-name}-{master,slave}-{instance-id} > {code} > For example, the instances of a Spark cluster called {{prod1}} would have the > following names: > {code} > spark-prod1-master-i-18a1f548 > spark-prod1-slave-i-01a1f551 > spark-prod1-slave-i-04a1f554 > spark-prod1-slave-i-05a1f555 > spark-prod1-slave-i-06a1f556 > {code} > Amazon implements instance names as > [tags|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html], so > that's what would need to be set for each launched instance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020855#comment-14020855 ] Prashant Sharma commented on SPARK-1812: We will need to release kafka, akka-zeromq and twitter chill for scala 2.11 > Support cross-building with Scala 2.11 > -- > > Key: SPARK-1812 > URL: https://issues.apache.org/jira/browse/SPARK-1812 > Project: Spark > Issue Type: New Feature > Components: Build, Spark Core >Reporter: Matei Zaharia >Assignee: Prashant Sharma > > Since Scala 2.10/2.11 are source compatible, we should be able to cross build > for both versions. From what I understand there are basically three things we > need to figure out: > 1. Have a two versions of our dependency graph, one that uses 2.11 > dependencies and the other that uses 2.10 dependencies. > 2. Figure out how to publish different poms for 2.10 and 2.11. > I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't > really well supported by Maven since published pom's aren't generated > dynamically. But we can probably script around it to make it work. I've done > some initial sanity checks with a simple build here: > https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020789#comment-14020789 ] Mridul Muralidharan commented on SPARK-2064: Depending on how long a job runs, this can cause OOM on the master. In yarn (and mesos ?) an executor on the same node gets different port if relaunched on failure - and so end up as different executor in the list. > web ui should not remove executors if they are dead > --- > > Key: SPARK-2064 > URL: https://issues.apache.org/jira/browse/SPARK-2064 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > > We should always show the list of executors that have ever been connected, > and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2066) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: key#61
Reynold Xin created SPARK-2066: -- Summary: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: key#61 Key: SPARK-2066 URL: https://issues.apache.org/jira/browse/SPARK-2066 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Cheng Lian Fix For: 1.0.1, 1.1.0 [~marmbrus] Run the following query {code} scala> c.hql("select key, count(*) from src").collect() {code} Got the following exception at runtime {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: key#61 at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:157) at org.apache.spark.sql.catalyst.expressions.Projection.apply(Projection.scala:35) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:154) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This should either fail in analysis time, or pass at runtime. Definitely shouldn't fail at runtime. -- This message was sent by Atlassian JIRA (v6.2#6252)