[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-07 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021115#comment-14021115
 ] 

Matei Zaharia commented on SPARK-2044:
--

Alright so I've posted my code at https://github.com/apache/spark/pull/1009. 
There are still two things missing:
* Moving MapOutputTracker behind this interface
* Moving aggregation into the ShuffleReaders and ShuffleWriters instead of 
having it inside RDD operations

Maybe we can open those as separate JIRAs and more people can work on them.

> Pluggable interface for shuffles
> 
>
> Key: SPARK-2044
> URL: https://issues.apache.org/jira/browse/SPARK-2044
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-938) OpenStack Swift Storage Support

2014-06-07 Thread Gil Vernik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021114#comment-14021114
 ] 

Gil Vernik commented on SPARK-938:
--

I am working on it. About to submit the patch. Almost done.

> OpenStack Swift Storage Support
> ---
>
> Key: SPARK-938
> URL: https://issues.apache.org/jira/browse/SPARK-938
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Examples, Input/Output, Spark Core
>Affects Versions: 0.8.1
>Reporter: Murali Raju
>Priority: Minor
>
> This issue is to track OpenStack Swift Storage support (development in 
> progress) in addition to S3 for Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021020#comment-14021020
 ] 

Mridul Muralidharan commented on SPARK-2064:


Ah, I assumed there was a disconnect.
So in yarn, log aggregation means we don't care about actual nodes where 
container was run (and we don't have access to compute nodes anyway usually).

What about listing only nodes and not executors ? Will that help ?

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2067) Spark logo in application UI uses absolute path

2014-06-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2067:
---

Target Version/s: 1.0.1, 1.1.0

> Spark logo in application UI uses absolute path
> ---
>
> Key: SPARK-2067
> URL: https://issues.apache.org/jira/browse/SPARK-2067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Trivial
>
> Link of the Spark logo in application UI (top left corner) is hard coded to 
> "/", and points to the wrong page when running with YARN proxy. Should use 
> uiRoot instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021018#comment-14021018
 ] 

Reynold Xin commented on SPARK-2064:


One thing is we can help identify executors that are dead, which is often 
important for debugging (finding out why they are dead - maybe disk space full 
resulting system irresponsive, etc). It is often also very useful information 
to have for spot instances on EC2 where executors might just die.

If memory is the problem, we can cap the number of dead executors the UI 
tracks; alternatively, we can put the list of dead executors onto external 
storage (a sqlite database or even just text file in the log directory).

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2056) Set RDD name to input path

2014-06-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2056:
---

Assignee: Neville Li

> Set RDD name to input path
> --
>
> Key: SPARK-2056
> URL: https://issues.apache.org/jira/browse/SPARK-2056
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Neville Li
>Assignee: Neville Li
>Priority: Trivial
> Fix For: 1.1.0
>
>
> RDDs have no names by default. Setting them to input path after opening from 
> file system makes it easier to understand job performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2056) Set RDD name to input path

2014-06-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2056.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 992
[https://github.com/apache/spark/pull/992]

> Set RDD name to input path
> --
>
> Key: SPARK-2056
> URL: https://issues.apache.org/jira/browse/SPARK-2056
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Neville Li
>Priority: Trivial
> Fix For: 1.1.0
>
>
> RDDs have no names by default. Setting them to input path after opening from 
> file system makes it easier to understand job performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021011#comment-14021011
 ] 

Mridul Muralidharan commented on SPARK-2064:


I am probably missing the intent behind this change.
What is the expected use case it is supposed to help with ?

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021008#comment-14021008
 ] 

Mridul Muralidharan commented on SPARK-2064:


Unfortunately OOM is a very big issue for us since application master is single 
point of failure when running in yarn.
Particularly when memory is constrained and vigorously enforced by the yarn 
containers (requiring higher overheads to be specified reducing usable memory 
even further.

Given this, and given the fair churn already for executor containers, I am 
hesitant about features which add to the memory footprint for UI even further. 
The cumulative impact of ui is nontrivial as I mentioned before. This, for 
example, would require 1-8% of master memory when there is reasonable churn for 
long running jobs (30 hours) on reasonable number of executors (200-300).


> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020976#comment-14020976
 ] 

Patrick Wendell edited comment on SPARK-2064 at 6/7/14 9:16 PM:


I don't think OOM is an issue here - but I think this used to be the behavior 
and users requested that we clean up the old executors because otherwise for a 
long running service you get a really large list. Maybe we should have a TTL 
and remove dead executors after that time.


was (Author: pwendell):
I don't think OOM is an issue here - but I think this used to be the behavior 
and users requested that we clean up the old executors because otherwise for a 
long running service you get a really large list. Maybe we should have a 
timeout.

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020976#comment-14020976
 ] 

Patrick Wendell commented on SPARK-2064:


I don't think OOM is an issue here - but I think this used to be the behavior 
and users requested that we clean up the old executors because otherwise for a 
long running service you get a really large list. Maybe we should have a 
timeout.

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020971#comment-14020971
 ] 

Reynold Xin commented on SPARK-2064:


My estimation gave you a very high bound. How often does your cluster churn 
every single node 100 times?

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2059) Unresolved Attributes should cause a failure before execution time

2014-06-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2059:


Assignee: Cheng Lian

> Unresolved Attributes should cause a failure before execution time
> --
>
> Key: SPARK-2059
> URL: https://issues.apache.org/jira/browse/SPARK-2059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>
> Here's a partial solution: 
> https://github.com/marmbrus/spark/tree/analysisChecks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types

2014-06-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2063:


Assignee: Cheng Lian  (was: Michael Armbrust)

> Creating a SchemaRDD via sql() does not correctly resolve nested types
> --
>
> Key: SPARK-2063
> URL: https://issues.apache.org/jira/browse/SPARK-2063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>
> For example, from the typical twitter dataset:
> {code}
> scala> val popularTweets = sql("SELECT retweeted_status.text, 
> MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status 
> is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30")
> scala> popularTweets.toString
> 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'retweeted_status.text
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.sc

[jira] [Updated] (SPARK-2053) Add Catalyst expression for CASE WHEN

2014-06-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2053:


Assignee: Zongheng Yang

> Add Catalyst expression for CASE WHEN
> -
>
> Key: SPARK-2053
> URL: https://issues.apache.org/jira/browse/SPARK-2053
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
>
> Here's a rough start: 
> https://github.com/marmbrus/spark/commit/1209daaf49b0a87e7f68f89c79d02b446e624db3



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2060) Querying JSON Datasets with SQL and DSL in Spark SQL

2014-06-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2060:


Assignee: Yin Huai

> Querying JSON Datasets with SQL and DSL in Spark SQL
> 
>
> Key: SPARK-2060
> URL: https://issues.apache.org/jira/browse/SPARK-2060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types

2014-06-07 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020944#comment-14020944
 ] 

Michael Armbrust commented on SPARK-2063:
-

Here is a reproducible test case:
{code}
 case class TableName(tableName: String)
 TestSQLContext.sparkContext.parallelize(TableName("test") :: 
Nil).registerAsTable("tableName")

 case class NestedData(a: String)
 case class TopLevelRecord(n: NestedData)
 val nestedData =
   TestSQLContext.sparkContext.parallelize(
 TopLevelRecord(NestedData("value1")) ::
 TopLevelRecord(NestedData("value2")) :: Nil)
 nestedData.registerAsTable("nestedData")
 
 test("nested data") {
   val query1 = sql("SELECT n, n.a FROM nestedData GROUP BY a ORDER BY a LIMIT 
10")
   //query1.collect()
   val query2 = query1.select('a)
   checkAnswer(
 query2,
 "test")
 }
{code}


> Creating a SchemaRDD via sql() does not correctly resolve nested types
> --
>
> Key: SPARK-2063
> URL: https://issues.apache.org/jira/browse/SPARK-2063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Michael Armbrust
>
> For example, from the typical twitter dataset:
> {code}
> scala> val popularTweets = sql("SELECT retweeted_status.text, 
> MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status 
> is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30")
> scala> popularTweets.toString
> 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'retweeted_status.text
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala

[jira] [Created] (SPARK-2068) Remove other uses of @transient lazy val in physical plan nodes

2014-06-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2068:
---

 Summary: Remove other uses of @transient lazy val in physical plan 
nodes
 Key: SPARK-2068
 URL: https://issues.apache.org/jira/browse/SPARK-2068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.1.0


[SPARK-1994] was caused by this, we fixed it there, but in general doing 
planning on the slaves breaks a lot of our assumptions and seems to cause 
concurrency problems



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020936#comment-14020936
 ] 

Mridul Muralidharan commented on SPARK-2064:


It is 100 MB (or more) of memory which could be used elsewhere.
In our clusters, for example, the number of workers can be very high while the 
containers can be quite ephemeral when under load (and so lot of container 
losses); on other hand, memory per container is constrained to about 8 gig 
(lower when we account for overheads, etc).

So the amount of working memory in master reduces : we are finding that UI and 
related codepath is one of the portions which seems to be occupying a lot of 
memory in the OOM dumps of master.

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2067) Spark logo in application UI uses absolute path

2014-06-07 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020932#comment-14020932
 ] 

Neville Li commented on SPARK-2067:
---

A simple fix: https://github.com/apache/spark/pull/1006

> Spark logo in application UI uses absolute path
> ---
>
> Key: SPARK-2067
> URL: https://issues.apache.org/jira/browse/SPARK-2067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Trivial
>
> Link of the Spark logo in application UI (top left corner) is hard coded to 
> "/", and points to the wrong page when running with YARN proxy. Should use 
> uiRoot instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2067) Spark logo in application UI uses absolute path

2014-06-07 Thread Neville Li (JIRA)
Neville Li created SPARK-2067:
-

 Summary: Spark logo in application UI uses absolute path
 Key: SPARK-2067
 URL: https://issues.apache.org/jira/browse/SPARK-2067
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Trivial


Link of the Spark logo in application UI (top left corner) is hard coded to 
"/", and points to the wrong page when running with YARN proxy. Should use 
uiRoot instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020897#comment-14020897
 ] 

Reynold Xin commented on SPARK-2064:


Is memory really an issue here?

On a 1000 node cluster, let's say we need 1KB to track each executor (should be 
more than enough), then we need 1MB to track all of them. In less than 100MB, 
we can crash & restart all of them 100 times.

If it really becomes the problem perhaps we can clean dead ones after a certain 
time period.

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2065) Have spark-ec2 set EC2 instance names

2014-06-07 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020893#comment-14020893
 ] 

Nicholas Chammas commented on SPARK-2065:
-

Sure, I'd love to.

> Have spark-ec2 set EC2 instance names
> -
>
> Key: SPARK-2065
> URL: https://issues.apache.org/jira/browse/SPARK-2065
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> {{spark-ec2}} launches EC2 instances with no names. It would be nice if it 
> gave each instance it launched a descriptive name.
> I suggest:
> {code}
> spark-{spark-cluster-name}-{master,slave}-{instance-id}
> {code}
> For example, the instances of a Spark cluster called {{prod1}} would have the 
> following names:
> {code}
> spark-prod1-master-i-18a1f548
> spark-prod1-slave-i-01a1f551
> spark-prod1-slave-i-04a1f554
> spark-prod1-slave-i-05a1f555
> spark-prod1-slave-i-06a1f556
> {code}
> Amazon implements instance names as 
> [tags|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html], so 
> that's what would need to be set for each launched instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11

2014-06-07 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020855#comment-14020855
 ] 

Prashant Sharma commented on SPARK-1812:


We will need to release kafka, akka-zeromq and twitter chill for scala 2.11

> Support cross-building with Scala 2.11
> --
>
> Key: SPARK-1812
> URL: https://issues.apache.org/jira/browse/SPARK-1812
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Spark Core
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>
> Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
> for both versions. From what I understand there are basically three things we 
> need to figure out:
> 1. Have a two versions of our dependency graph, one that uses 2.11 
> dependencies and the other that uses 2.10 dependencies.
> 2. Figure out how to publish different poms for 2.10 and 2.11.
> I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
> really well supported by Maven since published pom's aren't generated 
> dynamically. But we can probably script around it to make it work. I've done 
> some initial sanity checks with a simple build here:
> https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020789#comment-14020789
 ] 

Mridul Muralidharan commented on SPARK-2064:


Depending on how long a job runs, this can cause OOM on the master.
In yarn (and mesos ?) an executor on the same node gets different port if 
relaunched on failure - and so end up as different executor in the list.

> web ui should not remove executors if they are dead
> ---
>
> Key: SPARK-2064
> URL: https://issues.apache.org/jira/browse/SPARK-2064
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>
> We should always show the list of executors that have ever been connected, 
> and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2066) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: key#61

2014-06-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2066:
--

 Summary: 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to 
evaluate expression. type: AttributeReference, tree: key#61
 Key: SPARK-2066
 URL: https://issues.apache.org/jira/browse/SPARK-2066
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Cheng Lian
 Fix For: 1.0.1, 1.1.0


[~marmbrus]

Run the following query
{code}
scala> c.hql("select key, count(*) from src").collect()
{code}

Got the following exception at runtime
{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to 
evaluate expression. type: AttributeReference, tree: key#61
at 
org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.Projection.apply(Projection.scala:35)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:154)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

This should either fail in analysis time, or pass at runtime. Definitely 
shouldn't fail at runtime.



--
This message was sent by Atlassian JIRA
(v6.2#6252)