[jira] [Updated] (SPARK-2227) Support "dfs" command

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2227:
---

Assignee: Reynold Xin

> Support "dfs" command
> -
>
> Key: SPARK-2227
> URL: https://issues.apache.org/jira/browse/SPARK-2227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Minor
>
> Potentially just delegate to Hive. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2227) Support "dfs" command

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2227:
---

Affects Version/s: (was: 1.0.0)

> Support "dfs" command
> -
>
> Key: SPARK-2227
> URL: https://issues.apache.org/jira/browse/SPARK-2227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Minor
>
> Potentially just delegate to Hive. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2227) Support "dfs" command

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2227:
---

Priority: Minor  (was: Major)

> Support "dfs" command
> -
>
> Key: SPARK-2227
> URL: https://issues.apache.org/jira/browse/SPARK-2227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Priority: Minor
>
> Potentially just delegate to Hive. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2179) Public API for DataTypes and Schema

2014-06-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039682#comment-14039682
 ] 

Yin Huai commented on SPARK-2179:
-

Just a note. For every data type we have, I think it is better to have a string 
name (e.g. "int" for IntegerType).

> Public API for DataTypes and Schema
> ---
>
> Key: SPARK-2179
> URL: https://issues.apache.org/jira/browse/SPARK-2179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Yin Huai
>
> We want something like the following:
>  * Expose DataType in the SQL package and lock down all the internal details 
> (TypeTags, etc)
>  * Programatic API for viewing the schema of an RDD as a StructType
>  * Method that creates a schema RDD given (RDD[A], StructType, A => Row)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2227) Support "dfs" command

2014-06-20 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2227:
--

 Summary: Support "dfs" command
 Key: SPARK-2227
 URL: https://issues.apache.org/jira/browse/SPARK-2227
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin


Potentially just delegate to Hive. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2061) Deprecate `splits` in JavaRDDLike and add `partitions`

2014-06-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2061.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1062
[https://github.com/apache/spark/pull/1062]

> Deprecate `splits` in JavaRDDLike and add `partitions`
> --
>
> Key: SPARK-2061
> URL: https://issues.apache.org/jira/browse/SPARK-2061
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Reporter: Patrick Wendell
>Assignee: Anant Daksh Asthana
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Most of spark has used over to consistently using `partitions` instead of 
> `splits`. We should do likewise and add a `partitions` method to JavaRDDLike 
> and have `splits` just call that. We should also go through all cases where 
> other API's (e.g. Python) call `splits` and we should change those to use the 
> newer API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1777) Pass "cached" blocks directly to disk if memory is not large enough

2014-06-20 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1777:
-

Attachment: spark-1777-design-doc.pdf

> Pass "cached" blocks directly to disk if memory is not large enough
> ---
>
> Key: SPARK-1777
> URL: https://issues.apache.org/jira/browse/SPARK-1777
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.1.0
>
> Attachments: spark-1777-design-doc.pdf
>
>
> Currently in Spark we entirely unroll a partition and then check whether it 
> will cause us to exceed the storage limit. This has an obvious problem - if 
> the partition itself is enough to push us over the storage limit (and 
> eventually over the JVM heap), it will cause an OOM.
> This can happen in cases where a single partition is very large or when 
> someone is running examples locally with a small heap.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/CacheManager.scala#L106
> We should think a bit about the most elegant way to fix this - it shares some 
> similarities with the external aggregation code.
> A simple idea is to periodically check the size of the buffer as we are 
> unrolling and see if we are over the memory limit. If we are we could prepend 
> the existing buffer to the iterator and write that entire thing out to disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1970) Update unit test in XORShiftRandomSuite to use ChiSquareTest from commons-math3

2014-06-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1970.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1073
[https://github.com/apache/spark/pull/1073]

> Update unit test in XORShiftRandomSuite to use ChiSquareTest from 
> commons-math3
> ---
>
> Key: SPARK-1970
> URL: https://issues.apache.org/jira/browse/SPARK-1970
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Doris Xin
>Assignee: Doris Xin
>Priority: Minor
> Fix For: 1.1.0
>
>
> Since we're adding commons-math3 as a test dependency to spark core, updating 
> the test for randomness in XORShiftRandomSuite to use an actual chi square 
> test as opposed to hardcoding everything.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1902) Spark shell prints error when :4040 port already in use

2014-06-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1902.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1019
[https://github.com/apache/spark/pull/1019]

> Spark shell prints error when :4040 port already in use
> ---
>
> Key: SPARK-1902
> URL: https://issues.apache.org/jira/browse/SPARK-1902
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
> Fix For: 1.1.0
>
>
> When running two shells on the same machine, I get the below error.  The 
> issue is that the first shell takes port 4040, then the next tries tries 4040 
> and fails so falls back to 4041, then a third would try 4040 and 4041 before 
> landing on 4042, etc.
> We should catch the error and instead log as "Unable to use port 4041; 
> already in use.  Attempting port 4042..."
> {noformat}
> 14/05/22 11:31:54 WARN component.AbstractLifeCycle: FAILED 
> SelectChannelConnector@0.0.0.0:4041: java.net.BindException: Address already 
> in use
> java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:444)
> at sun.nio.ch.Net.bind(Net.java:436)
> at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
> at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
> at 
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
> at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at org.eclipse.jetty.server.Server.doStart(Server.java:293)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
> at org.apache.spark.SparkContext.(SparkContext.scala:217)
> at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
> at $line3.$read$$iwC$$iwC.(:8)
> at $line3.$read$$iwC.(:14)
> at $line3.$read.(:16)
> at $line3.$read$.(:20)
> at $line3.$read$.()
> at $line3.$eval$.(:7)
> at $line3.$eval$.()
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
> at 
> org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
> at org.apache.spark.repl.Spar

[jira] [Commented] (SPARK-1652) Fixes and improvements for spark-submit/configs

2014-06-20 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039565#comment-14039565
 ] 

Nicholas Chammas commented on SPARK-1652:
-

[~pwendell], is there currently a work-around for running Python driver 
programs on the cluster?

Trying this in 1.0.0 currently yields a succinct: 
{code}
Error: Cannot currently run Python driver programs on cluster
{code}

I'm not sure if there is a separate issue to track this, or if this is the 
issue I should watch.

> Fixes and improvements for spark-submit/configs
> ---
>
> Key: SPARK-1652
> URL: https://issues.apache.org/jira/browse/SPARK-1652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.1.0
>
>
> These are almost all a result of my config patch. Unfortunately the changes 
> were difficult to unit-test and there several edge cases reported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1856) Standardize MLlib interfaces

2014-06-20 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039495#comment-14039495
 ] 

Erik Erlandson commented on SPARK-1856:
---

How does this jira relate (if at all) to the MLI project, part of whose purpose 
is (or was) to provide a unified and pluggable type system for machine learning 
models, their inputs, parameters and training?

http://www.cs.berkeley.edu/~ameet/mlbase_website/mlbase_website/publications.html


> Standardize MLlib interfaces
> 
>
> Key: SPARK-1856
> URL: https://issues.apache.org/jira/browse/SPARK-1856
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.1.0
>
>
> Instead of expanding MLlib based on the current class naming scheme 
> (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
> clearly separate datasets, formulations, algorithms, parameter sets, and 
> models.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2225) Turn HAVING without GROUP BY into WHERE

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2225.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Reynold Xin  (was: William Benton)

> Turn HAVING without GROUP BY into WHERE
> ---
>
> Key: SPARK-2225
> URL: https://issues.apache.org/jira/browse/SPARK-2225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.0.1, 1.1.0
>
>
> See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx
> The HAVING clause specifies conditions that determines the groups included in 
> the query. If the SQL SELECT statement does not contain aggregate functions, 
> you can use a SQL SELECT statement that contains a HAVING clause without a 
> GROUP BY clause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2180) Basic HAVING clauses support for HiveQL

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2180:
---

Assignee: William Benton

> Basic HAVING clauses support for HiveQL
> ---
>
> Key: SPARK-2180
> URL: https://issues.apache.org/jira/browse/SPARK-2180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: William Benton
>Assignee: William Benton
>Priority: Minor
>
> The HiveQL implementation doesn't support HAVING clauses for aggregations.  
> This prevents some of the TPCDS benchmarks from running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections

2014-06-20 Thread Charles Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039396#comment-14039396
 ] 

Charles Reiss commented on SPARK-704:
-

It's been a while since I reported this issue, so it may have been incidentally 
fixed.

But this problem was with a remote node failure _after_ a message (or several 
messages) was successfully sent to that node but before a response was 
received. So, there would be no message to send to trigger a failing attempt to 
write to the channel.

If there's a corresponding ReceivingConnection, then the remote node death 
would be detected via a failed read, but I believe the code in 
ConnectionManager#removeConnection would not reliably trigger the 
MessageStatuses.

> ConnectionManager sometimes cannot detect loss of sending connections
> -
>
> Key: SPARK-704
> URL: https://issues.apache.org/jira/browse/SPARK-704
> Project: Spark
>  Issue Type: Bug
>Reporter: Charles Reiss
>Assignee: Henry Saputra
>
> ConnectionManager currently does not detect when SendingConnections 
> disconnect except if it is trying to send through them. As a result, a node 
> failure just after a connection is initiated but before any acknowledgement 
> messages can be sent may result in a hang.
> ConnectionManager has code intended to detect this case by detecting the 
> failure of a corresponding ReceivingConnection, but this code assumes that 
> the remote host:port of the ReceivingConnection is the same as the 
> ConnectionManagerId, which is almost never true. Additionally, there does not 
> appear to be any reason to assume a corresponding ReceivingConnection will 
> exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2225) Turn HAVING without GROUP BY into WHERE

2014-06-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039365#comment-14039365
 ] 

Reynold Xin commented on SPARK-2225:


Yea I don't think we need to follow the negative tests (we should make sure we 
follow the positive tests). I will submit a PR and if you can review that it'd 
be great.

> Turn HAVING without GROUP BY into WHERE
> ---
>
> Key: SPARK-2225
> URL: https://issues.apache.org/jira/browse/SPARK-2225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: William Benton
>
> See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx
> The HAVING clause specifies conditions that determines the groups included in 
> the query. If the SQL SELECT statement does not contain aggregate functions, 
> you can use a SQL SELECT statement that contains a HAVING clause without a 
> GROUP BY clause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2225) Turn HAVING without GROUP BY into WHERE

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2225:
---

Assignee: William Benton

> Turn HAVING without GROUP BY into WHERE
> ---
>
> Key: SPARK-2225
> URL: https://issues.apache.org/jira/browse/SPARK-2225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: William Benton
>
> See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx
> The HAVING clause specifies conditions that determines the groups included in 
> the query. If the SQL SELECT statement does not contain aggregate functions, 
> you can use a SQL SELECT statement that contains a HAVING clause without a 
> GROUP BY clause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2226:
---

Assignee: William Benton  (was: Reynold Xin)

> HAVING should be able to contain aggregate expressions that don't appear in 
> the aggregation list. 
> --
>
> Key: SPARK-2226
> URL: https://issues.apache.org/jira/browse/SPARK-2226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: William Benton
>
> https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
> This test file contains the following query:
> {code}
> SELECT key FROM src GROUP BY key HAVING max(value) > "val_255";
> {code}
> Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2226:
---

Assignee: William Benton  (was: Reynold Xin)

> HAVING should be able to contain aggregate expressions that don't appear in 
> the aggregation list. 
> --
>
> Key: SPARK-2226
> URL: https://issues.apache.org/jira/browse/SPARK-2226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: William Benton
>
> https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
> This test file contains the following query:
> {code}
> SELECT key FROM src GROUP BY key HAVING max(value) > "val_255";
> {code}
> Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-2226:
--

Assignee: Reynold Xin  (was: William Benton)

> HAVING should be able to contain aggregate expressions that don't appear in 
> the aggregation list. 
> --
>
> Key: SPARK-2226
> URL: https://issues.apache.org/jira/browse/SPARK-2226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
> This test file contains the following query:
> {code}
> SELECT key FROM src GROUP BY key HAVING max(value) > "val_255";
> {code}
> Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-06-20 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039355#comment-14039355
 ] 

Anant Daksh Asthana commented on SPARK-2198:


I am in agreement with Helena.

> Partition the scala build file so that it is easier to maintain
> ---
>
> Key: SPARK-2198
> URL: https://issues.apache.org/jira/browse/SPARK-2198
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Helena Edelson
>Priority: Minor
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
> the SparkBuild clean to describe the modules and their deps so that changes 
> in versions, for example, need only be made in Version.scala, settings 
> changes such as in scalac in Settings.scala, etc.
> I'd be happy to do this ([~helena_e])



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2226:
---

Assignee: Reynold Xin

> HAVING should be able to contain aggregate expressions that don't appear in 
> the aggregation list. 
> --
>
> Key: SPARK-2226
> URL: https://issues.apache.org/jira/browse/SPARK-2226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
> This test file contains the following query:
> {code}
> SELECT key FROM src GROUP BY key HAVING max(value) > "val_255";
> {code}
> Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2180) Basic HAVING clauses support for HiveQL

2014-06-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039344#comment-14039344
 ] 

Reynold Xin commented on SPARK-2180:


[~willbenton] I added two related issues. Let me know if you would like to work 
on them. Thanks!


> Basic HAVING clauses support for HiveQL
> ---
>
> Key: SPARK-2180
> URL: https://issues.apache.org/jira/browse/SPARK-2180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: William Benton
>Priority: Minor
>
> The HiveQL implementation doesn't support HAVING clauses for aggregations.  
> This prevents some of the TPCDS benchmarks from running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-06-20 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2226:
--

 Summary: HAVING should be able to contain aggregate expressions 
that don't appear in the aggregation list. 
 Key: SPARK-2226
 URL: https://issues.apache.org/jira/browse/SPARK-2226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin


https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q

This test file contains the following query:
{code}
SELECT key FROM src GROUP BY key HAVING max(value) > "val_255";
{code}

Once we fixed this issue, we should whitelist having.q.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2225) Turn HAVING without GROUP BY into WHERE

2014-06-20 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039335#comment-14039335
 ] 

William Benton commented on SPARK-2225:
---

So the Hive test suite treats HAVING without GROUP BY as an error (see 
having1.q), although it is accepted by some dialects (like SQL Server).  I'm 
happy to take this, though, if this is what we want to do here.

> Turn HAVING without GROUP BY into WHERE
> ---
>
> Key: SPARK-2225
> URL: https://issues.apache.org/jira/browse/SPARK-2225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>
> See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx
> The HAVING clause specifies conditions that determines the groups included in 
> the query. If the SQL SELECT statement does not contain aggregate functions, 
> you can use a SQL SELECT statement that contains a HAVING clause without a 
> GROUP BY clause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2225) Turn HAVING without GROUP BY into WHERE

2014-06-20 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2225:
--

 Summary: Turn HAVING without GROUP BY into WHERE
 Key: SPARK-2225
 URL: https://issues.apache.org/jira/browse/SPARK-2225
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin


See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx

The HAVING clause specifies conditions that determines the groups included in 
the query. If the SQL SELECT statement does not contain aggregate functions, 
you can use a SQL SELECT statement that contains a HAVING clause without a 
GROUP BY clause.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2180) Basic HAVING clauses support for HiveQL

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2180:
---

Summary: Basic HAVING clauses support for HiveQL  (was: HiveQL doesn't 
support GROUP BY with HAVING clauses)

> Basic HAVING clauses support for HiveQL
> ---
>
> Key: SPARK-2180
> URL: https://issues.apache.org/jira/browse/SPARK-2180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: William Benton
>Priority: Minor
>
> The HiveQL implementation doesn't support HAVING clauses for aggregations.  
> This prevents some of the TPCDS benchmarks from running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-06-20 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039256#comment-14039256
 ] 

Erik Erlandson commented on SPARK-1406:
---

How about a PMML pickler/unpickler, written as an extension to:
https://github.com/scala/pickling


> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-06-20 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039236#comment-14039236
 ] 

Mridul Muralidharan commented on SPARK-2089:


[~pwendell] SplitInfo is not from hadoop - but gives locality preference in 
context spark (see org.apache.spark.scheduler.SplitInfo) in a reasonably api 
agnostic way.
The default support provided for it is hadoop specific based on dfs blocks - 
but I dont think there is anything stopping us from expressing other forms 
(either already currently or with minor modifications as applicable).

We actually very heavily use that api - moving 10s or 100s of TB of data tends 
to be fairly expensive :-)
Since we are still stuck in 0.9 + changes, have not yet faced this issue 
though, so great to see this being addressed.



> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-20 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039217#comment-14039217
 ] 

Mridul Muralidharan commented on SPARK-2223:


[~tgraves] You could try running zinc - speeds up the maven build quite a bit.
I find that I need to shutdown and restart it at times though .. but otherwise, 
works fine.

> Building and running tests with maven is extremely slow
> ---
>
> Key: SPARK-2223
> URL: https://issues.apache.org/jira/browse/SPARK-2223
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> For some reason using maven with Spark is extremely slow.  Building and 
> running tests takes way longer then other projects I have used that use 
> maven.  We should investigate to see why.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1528) Spark on Yarn: Add option for user to specify additional namenodes to get tokens from

2014-06-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039186#comment-14039186
 ] 

Thomas Graves commented on SPARK-1528:
--

https://github.com/apache/spark/pull/1159

> Spark on Yarn: Add option for user to specify additional namenodes to get 
> tokens from
> -
>
> Key: SPARK-1528
> URL: https://issues.apache.org/jira/browse/SPARK-1528
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> Some users running spark on yarn may wish to contact other Hdfs clusters then 
> the one they are running on.  We should add in an option for them to specify 
> those namenodes so that we can get the credentials needed for the application 
> to contact them.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors

2014-06-20 Thread Sebastien Rainville (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Rainville updated SPARK-2204:
---

Description: MesosSchedulerBackend.resourceOffers(SchedulerDriver, 
List[Offer]) is assuming that 
TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in 
the same order as the offers it was passed, but in the current implementation 
TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the 
tasks always to the same executors. The result is that the tasks are launched 
on the wrong executors.  (was: 
MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming 
that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists 
in the same order as the offers it was passed, but in the current 
implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
assigning the tasks always to the same executors. The result is that the tasks 
are launched on random executors.)

> Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
> --
>
> Key: SPARK-2204
> URL: https://issues.apache.org/jira/browse/SPARK-2204
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Sebastien Rainville
>Priority: Blocker
>
> MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is 
> assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning 
> task lists in the same order as the offers it was passed, but in the current 
> implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
> assigning the tasks always to the same executors. The result is that the 
> tasks are launched on the wrong executors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1528) Spark on Yarn: Add option for user to specify additional namenodes to get tokens from

2014-06-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-1528:


Assignee: Thomas Graves

> Spark on Yarn: Add option for user to specify additional namenodes to get 
> tokens from
> -
>
> Key: SPARK-1528
> URL: https://issues.apache.org/jira/browse/SPARK-1528
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> Some users running spark on yarn may wish to contact other Hdfs clusters then 
> the one they are running on.  We should add in an option for them to specify 
> those namenodes so that we can get the credentials needed for the application 
> to contact them.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs

2014-06-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1868.


Resolution: Fixed
  Assignee: Allan Douglas R. de Oliveira

Fixed by this PR:
https://github.com/apache/spark/pull/813

> Users should be allowed to cogroup at least 4 RDDs
> --
>
> Key: SPARK-1868
> URL: https://issues.apache.org/jira/browse/SPARK-1868
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Reporter: Allan Douglas R. de Oliveira
>Assignee: Allan Douglas R. de Oliveira
>
> cogroup client api currently allows up to 3 RDDs to be cogrouped.
> It's convenient to allow more than this as cogroup is a very fundamental 
> operation and in the real word we need to group many RDDs together.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs

2014-06-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1868:
---

Fix Version/s: 1.1.0

> Users should be allowed to cogroup at least 4 RDDs
> --
>
> Key: SPARK-1868
> URL: https://issues.apache.org/jira/browse/SPARK-1868
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Reporter: Allan Douglas R. de Oliveira
>Assignee: Allan Douglas R. de Oliveira
> Fix For: 1.1.0
>
>
> cogroup client api currently allows up to 3 RDDs to be cogrouped.
> It's convenient to allow more than this as cogroup is a very fundamental 
> operation and in the real word we need to group many RDDs together.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039081#comment-14039081
 ] 

Thomas Graves commented on SPARK-2224:
--

well that worked as in it actually ran in the yarn submodule, unfortunately it 
didn't actually run any tests (Tests: succeeded 0, failed 0, canceled 0, 
ignored 0, pending 0).  Closing this one since it works for other modules and 
I'll open another one if needed for yarn.



> allow running tests for one sub module
> --
>
> Key: SPARK-2224
> URL: https://issues.apache.org/jira/browse/SPARK-2224
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> We should have a way to run just the unit tests in a submodule (like core or 
> yarn, etc.).
> One way would be to support changing directories into the submodule and 
> running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2224.
--

Resolution: Invalid

> allow running tests for one sub module
> --
>
> Key: SPARK-2224
> URL: https://issues.apache.org/jira/browse/SPARK-2224
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> We should have a way to run just the unit tests in a submodule (like core or 
> yarn, etc.).
> One way would be to support changing directories into the submodule and 
> running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039076#comment-14039076
 ] 

Thomas Graves commented on SPARK-2223:
--

[~srowen] do you know how long it takes you to do from scratch build (just as a 
comparison).  Maybe there is something messed up in our infrastructure if it 
isn't slow for others.

> Building and running tests with maven is extremely slow
> ---
>
> Key: SPARK-2223
> URL: https://issues.apache.org/jira/browse/SPARK-2223
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> For some reason using maven with Spark is extremely slow.  Building and 
> running tests takes way longer then other projects I have used that use 
> maven.  We should investigate to see why.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-20 Thread Suren Hiraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suren Hiraman updated SPARK-2202:
-

Attachment: spark_trace.2.txt
spark_trace.1.txt

Here are two stack traces from the same server. This time around, we ended up 
with 5 stuck threads on 2 different servers.

As a reminder, this is with re-enabling the various custom settings listed 
above.

> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
> Attachments: spark_trace.1.txt, spark_trace.2.txt
>
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1882) Support dynamic memory sharing in Mesos

2014-06-20 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038987#comment-14038987
 ] 

Andrew Ash commented on SPARK-1882:
---

Yeah, for homogeneous environments, I think you can get full utilization of 
both CPU and memory across the cluster.  It could work like this:

Suppose each machine has 16 cores and 256GB memory, which is 16GB per core.  
Leave {{spark.task.cpus}} at the default of 1 and set {{spark.executor.memory}} 
to 16GB.  Now each task launched grabs one core and 16GB.  Once they're all 
taken on a machine, it's fully maxed out in both memory and CPU.

But if we have heterogeneous machines with different CPU:memory ratios, I think 
we'd be in trouble.  We couldn't pick a ratio that fully utilizes all machines, 
so we'd have either under-utilized CPUs or under-utilized memory for machines 
with low CPU:memory vs high CPU:memory ratios, respectively.

The suggestion of un-coupling cores and memory is a good one -- if each task 
accepted an amount of memory proportional to the remaining memory on the 
machine, then I think you could get good utilization even across heterogeneous 
environments

> Support dynamic memory sharing in Mesos
> ---
>
> Key: SPARK-1882
> URL: https://issues.apache.org/jira/browse/SPARK-1882
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> Fine grained mode Mesos currently supports sharing CPUs very well, but 
> requires that memory be pre-partitioned according to the executor memory 
> parameter.  Mesos supports dynamic memory allocation in addition to dynamic 
> CPU allocation, so we should utilize this feature in Spark.
> See below where when the Mesos backend accepts a resource offer it only 
> checks that there's enough memory to cover sc.executorMemory, and doesn't 
> ever take a fraction of the memory available.  The memory offer is accepted 
> all or nothing from a pre-defined parameter.
> Coarse mode:
> https://github.com/apache/spark/blob/3ce526b168050c572a1feee8e0121e1426f7d9ee/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L208
> Fine mode:
> https://github.com/apache/spark/blob/a5150d199ca97ab2992bc2bb221a3ebf3d3450ba/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L114



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038981#comment-14038981
 ] 

Thomas Graves commented on SPARK-2224:
--

Oh, I bet only ran with package and not install, I'll try that and if it works 
we can close this.

> allow running tests for one sub module
> --
>
> Key: SPARK-2224
> URL: https://issues.apache.org/jira/browse/SPARK-2224
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> We should have a way to run just the unit tests in a submodule (like core or 
> yarn, etc.).
> One way would be to support changing directories into the submodule and 
> running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038976#comment-14038976
 ] 

Thomas Graves commented on SPARK-2223:
--

Some of our nightly builds are taking 50 minutes without running the tests.  
Many of them do update to latest but its still not a total from scratch build.  

I also just tried to run 1 single test inside of yarn and it took over 20 
minutes for it to even get to that module to try to build and this was after 
already having built with -skipTests.   Now when I do this a second time its 
very quick, but the first time took forever.


> Building and running tests with maven is extremely slow
> ---
>
> Key: SPARK-2223
> URL: https://issues.apache.org/jira/browse/SPARK-2223
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> For some reason using maven with Spark is extremely slow.  Building and 
> running tests takes way longer then other projects I have used that use 
> maven.  We should investigate to see why.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038974#comment-14038974
 ] 

Sean Owen commented on SPARK-2224:
--

Ah yes you still need to have built the packages. "mvn -DskipTests install" 
before.

> allow running tests for one sub module
> --
>
> Key: SPARK-2224
> URL: https://issues.apache.org/jira/browse/SPARK-2224
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> We should have a way to run just the unit tests in a submodule (like core or 
> yarn, etc.).
> One way would be to support changing directories into the submodule and 
> running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038971#comment-14038971
 ] 

Thomas Graves commented on SPARK-2224:
--

Thanks, wasn't away of -pl.. doesn't seem to work for yarn though. get error:
[WARNING] The POM for org.apache.spark:spark-core_2.10:jar:1.1.0-SNAPSHOT is 
missing, no dependency information available


-DwildcardSuites still loops though all the modules.

> allow running tests for one sub module
> --
>
> Key: SPARK-2224
> URL: https://issues.apache.org/jira/browse/SPARK-2224
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> We should have a way to run just the unit tests in a submodule (like core or 
> yarn, etc.).
> One way would be to support changing directories into the submodule and 
> running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038964#comment-14038964
 ] 

Guoqiang Li commented on SPARK-2224:


 {{mvn -Phadoop-2.3 -Pyarn  -DwildcardSuites=org.apache.spark.deploy.yarn.* 
test}} ?

> allow running tests for one sub module
> --
>
> Key: SPARK-2224
> URL: https://issues.apache.org/jira/browse/SPARK-2224
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> We should have a way to run just the unit tests in a submodule (like core or 
> yarn, etc.).
> One way would be to support changing directories into the submodule and 
> running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038956#comment-14038956
 ] 

Sean Owen commented on SPARK-2224:
--

mvn test -pl [module], no?

> allow running tests for one sub module
> --
>
> Key: SPARK-2224
> URL: https://issues.apache.org/jira/browse/SPARK-2224
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> We should have a way to run just the unit tests in a submodule (like core or 
> yarn, etc.).
> One way would be to support changing directories into the submodule and 
> running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038944#comment-14038944
 ] 

Sean Owen commented on SPARK-2223:
--

Is it not just because there are a lot of tests, and they generally can't be 
run in parallel? I agree, an hour for tests is really long, but I'm not sure if 
there is a problem except having lots of tests. You can run subsets of tests 
with Maven though to test targeted changes.

> Building and running tests with maven is extremely slow
> ---
>
> Key: SPARK-2223
> URL: https://issues.apache.org/jira/browse/SPARK-2223
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> For some reason using maven with Spark is extremely slow.  Building and 
> running tests takes way longer then other projects I have used that use 
> maven.  We should investigate to see why.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-2224:


 Summary: allow running tests for one sub module
 Key: SPARK-2224
 URL: https://issues.apache.org/jira/browse/SPARK-2224
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0
Reporter: Thomas Graves


We should have a way to run just the unit tests in a submodule (like core or 
yarn, etc.).

One way would be to support changing directories into the submodule and running 
mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2163) Set ``setConvergenceTol'' with a parameter of type Double instead of Int

2014-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2163:
-

Fix Version/s: 1.1.0
   1.0.1

> Set ``setConvergenceTol'' with a parameter of type Double instead of Int
> 
>
> Key: SPARK-2163
> URL: https://issues.apache.org/jira/browse/SPARK-2163
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Gang Bai
> Fix For: 1.0.1, 1.1.0
>
>
> The class LBFGS in mllib.optimization currently provides a 
> {{setConvergenceTol(tolerance: Int)}} method for setting the convergence 
> tolerance. The tolerance parameter is of type {{Int}}. The specified 
> tolerance is then used as parameter in calling {{LBFGS.runLBFGS}}, where the 
> parameter {{convergenceTol}} is of type {{Double}}.
> The Int parameter may cause problem when one creates an optimizer and sets a 
> Double-valued tolerance. e.g:
> {code:borderStyle=solid}
> override val optimizer = new LBFGS(gradient, updater)
>   .setNumCorrections(9)
>   .setConvergenceTol(1e-4)  // *type mismatch here*
>   .setMaxNumIterations(100)
>   .setRegParam(1.0)
> {code}
> IMHO there is no need to make the tolerance of type Int. Let's change it into 
> a Double parameter and eliminate the type mismatch problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-20 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-2223:


 Summary: Building and running tests with maven is extremely slow
 Key: SPARK-2223
 URL: https://issues.apache.org/jira/browse/SPARK-2223
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Thomas Graves


For some reason using maven with Spark is extremely slow.  Building and running 
tests takes way longer then other projects I have used that use maven.  We 
should investigate to see why.  




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2014-06-20 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038921#comment-14038921
 ] 

Erik Erlandson commented on SPARK-1493:
---

I submitted a proposal patch for RAT-161, which allows one to request 
path-spanning patterns by including a leading '/'

If '--dir' argument is /path/to/repo, and contents of '-E' file includes:
/subpath/to/.*ext

then the pattern induced is:
/path/to/repo + /subpath/to/.*ex t --> /path/to/repo/subpath/to/.*ext


> Apache RAT excludes don't work with file path (instead of file name)
> 
>
> Key: SPARK-1493
> URL: https://issues.apache.org/jira/browse/SPARK-1493
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>  Labels: starter
>
> Right now the way we do RAT checks, it doesn't work if you try to exclude:
> /path/to/file.ext
> you have to just exclude
> file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2222) Add multiclass evaluation metrics

2014-06-20 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-:
---

 Summary: Add multiclass evaluation metrics
 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alexander Ulanov


There is no class in Spark MLlib for measuring the performance of multiclass 
classifiers. This task involves adding such class and unit tests. The following 
measures are to be implemented: per class, micro averaged and weighted averaged 
Precision, Recall and F1-Measure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2221) Spark fails on windows YARN (HortonWorks 2.4)

2014-06-20 Thread Vlad Rabenok (JIRA)
Vlad Rabenok created SPARK-2221:
---

 Summary: Spark fails on windows YARN (HortonWorks 2.4)
 Key: SPARK-2221
 URL: https://issues.apache.org/jira/browse/SPARK-2221
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
 Environment: Windows 2008 R2
HortonWorks Hadoop cluster 2.4
Spark 1.0
Reporter: Vlad Rabenok


Windows based Yarn cluster failed to execute submitted Spark job.
Can be reproduced by submitting JavaSparkPi  on  Yarn  cluster.

C:\hdp\spark-1.0.0-bin-hadoop2\bin>spark-submit --class 
org.apache.spark.examples.JavaSparkPi 
./../lib/spark-examples-1.0.0-hadoop2.2.0.jar --master yarn-cluster 
--deploy-mode cluster

The origin of the problems  is  
org.apache.spark.deploy.yarn.ExecutorRunnableUtil.scala. JavaOpts  parameters 
that contain environment variable need to be quoted.
- '%' should be escaped  like  "kill %%p" . Still doesn't have any sense on 
Windows. Ideally should be passed through "spark.executor.extraJavaOptions".
- javaOpts += "-Djava.io.tmpdir=\"" + new Path(Environment.PWD.$(), 
YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR) + "\""




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2112) ParquetTypesConverter should not create its own conf

2014-06-20 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038666#comment-14038666
 ] 

Andre Schumacher commented on SPARK-2112:
-

Since commit
https://github.com/apache/spark/commit/f479cf3743e416ee08e62806e1b34aff5998ac22
the SparkContext's Hadoop configuration should be used when reading metadata 
from the file source. I wasn't yet able to test this with say S3 bucket names.

Are the the S3 credentials copied from SparkConfig to its Hadoop configuration? 
 If someone could confirm this to be working we could close this issue.

> ParquetTypesConverter should not create its own conf
> 
>
> Key: SPARK-2112
> URL: https://issues.apache.org/jira/browse/SPARK-2112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>
> [~adav]: "this actually makes it so that we can't use S3 credentials set in 
> the SparkContext, or add new FileSystems at runtime, for instance."



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2195) Parquet extraMetadata can contain key information

2014-06-20 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038661#comment-14038661
 ] 

Andre Schumacher commented on SPARK-2195:
-

Since commit
https://github.com/apache/spark/commit/f479cf3743e416ee08e62806e1b34aff5998ac22
the path is no longer stored in the extraMetadata. So I guess this issue can be 
closed?

> Parquet extraMetadata can contain key information
> -
>
> Key: SPARK-2195
> URL: https://issues.apache.org/jira/browse/SPARK-2195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Priority: Blocker
>
> {code}
> 14/06/19 01:52:05 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: 
> file:/Users/pat/Projects/spark-summit-training-2014/usb/data/wiki-parquet/part-r-1.parquet
>  start: 0 length: 24971040 hosts: [localhost] blocks: 1 requestedSchema: same 
> as file fileSchema: message root {
>   optional int32 id;
>   optional binary title;
>   optional int64 modified;
>   optional binary text;
>   optional binary username;
> }
>  extraMetadata: 
> {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
>  StructField(title,StringType,true), StructField(modified,LongType,true), 
> StructField(text,StringType,true), StructField(username,StringType,true))), 
> path= MY AWS KEYS!!! } 
> readSupportMetadata: 
> {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
>  StructField(title,StringType,true), StructField(modified,LongType,true), 
> StructField(text,StringType,true), StructField(username,StringType,true))), 
> path= MY AWS KEYS 
> ***}}
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2163) Set ``setConvergenceTol'' with a parameter of type Double instead of Int

2014-06-20 Thread Gang Bai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Bai resolved SPARK-2163.
-

Resolution: Implemented

> Set ``setConvergenceTol'' with a parameter of type Double instead of Int
> 
>
> Key: SPARK-2163
> URL: https://issues.apache.org/jira/browse/SPARK-2163
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Gang Bai
>
> The class LBFGS in mllib.optimization currently provides a 
> {{setConvergenceTol(tolerance: Int)}} method for setting the convergence 
> tolerance. The tolerance parameter is of type {{Int}}. The specified 
> tolerance is then used as parameter in calling {{LBFGS.runLBFGS}}, where the 
> parameter {{convergenceTol}} is of type {{Double}}.
> The Int parameter may cause problem when one creates an optimizer and sets a 
> Double-valued tolerance. e.g:
> {code:borderStyle=solid}
> override val optimizer = new LBFGS(gradient, updater)
>   .setNumCorrections(9)
>   .setConvergenceTol(1e-4)  // *type mismatch here*
>   .setMaxNumIterations(100)
>   .setRegParam(1.0)
> {code}
> IMHO there is no need to make the tolerance of type Int. Let's change it into 
> a Double parameter and eliminate the type mismatch problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2220) Fix remaining Hive Commands

2014-06-20 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2220:
---

 Summary: Fix remaining Hive Commands
 Key: SPARK-2220
 URL: https://issues.apache.org/jira/browse/SPARK-2220
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.1.0


None of the following have an execution plan:
{code}
private[hive] case class DfsCommand(cmd: String) extends Command
private[hive] case class ShellCommand(cmd: String) extends Command
private[hive] case class SourceCommand(filePath: String) extends Command
private[hive] case class AddFile(filePath: String) extends Command
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2219) AddJar doesn't work

2014-06-20 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2219:
---

 Summary: AddJar doesn't work
 Key: SPARK-2219
 URL: https://issues.apache.org/jira/browse/SPARK-2219
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-06-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038569#comment-14038569
 ] 

Patrick Wendell commented on SPARK-2089:


Hey Sandy,

I think I removed this from the primary constructor because it was really 
unclear to me what it was for or how we wanted users to use it and it wasn't 
documented anywhere. So I figured it probably didn't belong in the parent 
constructor that we'll be supporting for the next 2 years :)

My bad for causing this bug though. I'd propose fixing this by just cleaning up 
the interface in the first place and having it so we can encode it in a 
SparkConf.

For one thing, there is no reason we should accept this as a Map[hostname, 
SplitInfo]. SplitInfo is a hadoop specific class... this means people can't 
express locality constraints for a YARN spark job if they are e.g. reading data 
from other storage systems, or they want to express any other type of policy.

I think we should instead just have something like `spark.yarn.preferredNodes` 
which is a comma-separated list of nodes. If we wanted to be a bit fancier, we 
could accept a weight or count for each node.

Then we could offer users a utility function somewhere that, given a 
Map[hostname, splitInfo] it could compute the correct value for the SparkConf 
setting. But users who are operating with non-hadoop storage systems, or who 
just wanted to express some locality constraints unrelated to storage (e.g., 
put all of my Spark containers on the same rack), could do it.

One thing is - I actually think this means the current constructor would just 
be broken, so I'm not sure whether to remove it (and break compatbility) or to 
just have log an ERROR or something telling users to use the new way.



> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2218) rename Equals to EqualTo in Spark SQL expressions

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2218.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

> rename Equals to EqualTo in Spark SQL expressions
> -
>
> Key: SPARK-2218
> URL: https://issues.apache.org/jira/browse/SPARK-2218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.0.1, 1.1.0
>
>
> The class name Equals is very error prone because there exists scala.Equals. 
> I just wasted a bunch of time debugging the optimizer because of this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2114) groupByKey and joins on raw data

2014-06-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038558#comment-14038558
 ] 

Patrick Wendell commented on SPARK-2114:


Hey Sandy,

One thing that would be really helpful is if you could produce a micro 
benchmark that can create the problems you are seeing. That way we could 
profile this and look at the relative benefits of various optimizations. For 
instance, it's possible that simpler things like keeping only the values 
serialized would yield substantial benefit with less complexity than what is 
proposed here.

A second general note is that use a sort-based shuffle would alleviate a lot of 
these issues. At least my guess is that in the MR shuffle objects are sorted in 
serialized form and only lazily serialized at the last possible second.

> groupByKey and joins on raw data
> 
>
> Key: SPARK-2114
> URL: https://issues.apache.org/jira/browse/SPARK-2114
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>
> For groupByKey and join transformations, Spark tasks on the reduce side 
> deserialize every record into a Java object before calling any user function. 
>  
> This causes all kinds of problems for garbage collection - when aggregating 
> enough data, objects can escape the young gen and trigger full GCs down the 
> line.  Additionally, when records are spilled, they must be serialized and 
> deserialized multiple times.
> It would be helpful to allow aggregations on serialized data - using some 
> sort of RawHasher interface that could implement hashCode and equals for 
> serialized records.  This would also require encoding record boundaries in 
> the serialized format, which I'm not sure we currently do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2209) Cast shouldn't do null check twice

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2209.


Resolution: Fixed

> Cast shouldn't do null check twice
> --
>
> Key: SPARK-2209
> URL: https://issues.apache.org/jira/browse/SPARK-2209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.0.1, 1.1.0
>
>
> Cast does two null checks, one in eval and another one in the function 
> returned by nullOrCast. It's best to get rid of the one in nullOrCast (since 
> eval will be the more common code path).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2196) Fix nullability of CaseWhen.

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2196.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Takuya Ueshin

> Fix nullability of CaseWhen.
> 
>
> Key: SPARK-2196
> URL: https://issues.apache.org/jira/browse/SPARK-2196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.0.1, 1.1.0
>
>
> {{CaseWhen}} should use {{branches.length}} to check if {{elseValue}} is 
> provided or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2203) PySpark does not infer default numPartitions in same way as Spark

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2203.


   Resolution: Fixed
Fix Version/s: 1.1.0

> PySpark does not infer default numPartitions in same way as Spark
> -
>
> Key: SPARK-2203
> URL: https://issues.apache.org/jira/browse/SPARK-2203
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 1.1.0
>
>
> For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), 
> PySpark will always assume that the default parallelism to use for the reduce 
> side is ctx.defaultParallelism, which is a constant typically determined by 
> the number of cores in cluster.
> In contrast, Spark's Partitioner#defaultPartitioner will use the same number 
> of reduce partitions as map partitions unless the defaultParallelism config 
> is explicitly set. This tends to be a better default in order to avoid OOMs, 
> and should also be the behavior of PySpark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2210) cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)

2014-06-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2210.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

> cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)
> --
>
> Key: SPARK-2210
> URL: https://issues.apache.org/jira/browse/SPARK-2210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.0.1, 1.1.0
>
>
> {code}
> explain select cast(cast(key=0 as boolean) as boolean) aaa from src
> {code}
> should be
> {code}
> [Physical execution plan:]
> [Project [(key#10:0 = 0) AS aaa#7]]
> [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
> {code}
> However, it is currently
> {code}
> [Physical execution plan:]
> [Project [NOT((key#10=0) = 0) AS aaa#7]]
> [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)