[jira] [Created] (SPARK-13713) Replace ANTLR3 SQL parser by a ANTLR4 SQL parser

2016-03-06 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-13713:
-

 Summary: Replace ANTLR3 SQL parser by a ANTLR4 SQL parser
 Key: SPARK-13713
 URL: https://issues.apache.org/jira/browse/SPARK-13713
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Herman van Hovell


Replace the current SQL Parser by a version written in ANTLR4. This has various 
advantages:
* Much simpler structure
* No code blowup
* Reduction in lines of code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13711) Apache Spark driver stopping JVM when master not available

2016-03-06 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182681#comment-15182681
 ] 

Shixiong Zhu commented on SPARK-13711:
--

Sorry, I misread the description. Did you run in the client mode? I think 
SparkUncaughtExceptionHandler should not be set for driver.

> Apache Spark driver stopping JVM when master not available 
> ---
>
> Key: SPARK-13711
> URL: https://issues.apache.org/jira/browse/SPARK-13711
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.6.0
>Reporter: Era
>
> In my application Java spark context is created with an unavailable master 
> URL (you may assume master is down for a maintenance). When creating Java 
> spark context it leads to stopping JVM that runs spark driver with JVM exit 
> code 50.
> When I checked the logs I found SparkUncaughtExceptionHandler calling the 
> System.exit. My program should run forever. 
> package test.mains;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaSparkContext;
> public class CheckJavaSparkContext {
> /**
>  * @param args the command line arguments
>  */
> public static void main(String[] args) {
> SparkConf conf = new SparkConf();
> conf.setAppName("test");
> conf.setMaster("spark://sunshinee:7077");
> try {
> new JavaSparkContext(conf);
> } catch (Throwable e) {
> System.out.println("Caught an exception : " + e.getMessage());
> //e.printStackTrace();
> }
> System.out.println("Waiting to complete...");
> while (true) {
> }
> }
> }
> Output log
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0
> 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a 
> loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0)
> 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233
> 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233
> 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(ps40233); users 
> with modify permissions: Set(ps40233)
> 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on 
> port 55309.
> 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started
> 16/03/04 18:01:21 INFO Remoting: Starting remoting
> 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128]
> 16/03/04 18:01:22 INFO Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 52128.
> 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker
> 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster
> 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98
> 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB
> 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040
> 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master 
> spark://sunshinee:7077...
> 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master 
> sunshinee:7077
> java.io.IOException: Failed to connect to sunshinee:7077
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>  at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>

[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182676#comment-15182676
 ] 

Wenchen Fan commented on SPARK-12718:
-

Hi [~smilegator], It seems that I underestimate the difficulty of this job. I 
have a simple PR which works fine for common cases, do you mind take a look and 
see what's missing? You can send out your PR to explain your approach and how 
to handle special cases.

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13711) Apache Spark driver stopping JVM when master not available

2016-03-06 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182674#comment-15182674
 ] 

Shixiong Zhu commented on SPARK-13711:
--

This is the correct behavior. Driver needs to talk with Master to request and 
launch executors.

> Apache Spark driver stopping JVM when master not available 
> ---
>
> Key: SPARK-13711
> URL: https://issues.apache.org/jira/browse/SPARK-13711
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.6.0
>Reporter: Era
>
> In my application Java spark context is created with an unavailable master 
> URL (you may assume master is down for a maintenance). When creating Java 
> spark context it leads to stopping JVM that runs spark driver with JVM exit 
> code 50.
> When I checked the logs I found SparkUncaughtExceptionHandler calling the 
> System.exit. My program should run forever. 
> package test.mains;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaSparkContext;
> public class CheckJavaSparkContext {
> /**
>  * @param args the command line arguments
>  */
> public static void main(String[] args) {
> SparkConf conf = new SparkConf();
> conf.setAppName("test");
> conf.setMaster("spark://sunshinee:7077");
> try {
> new JavaSparkContext(conf);
> } catch (Throwable e) {
> System.out.println("Caught an exception : " + e.getMessage());
> //e.printStackTrace();
> }
> System.out.println("Waiting to complete...");
> while (true) {
> }
> }
> }
> Output log
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0
> 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a 
> loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0)
> 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233
> 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233
> 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(ps40233); users 
> with modify permissions: Set(ps40233)
> 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on 
> port 55309.
> 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started
> 16/03/04 18:01:21 INFO Remoting: Starting remoting
> 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128]
> 16/03/04 18:01:22 INFO Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 52128.
> 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker
> 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster
> 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98
> 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB
> 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040
> 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master 
> spark://sunshinee:7077...
> 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master 
> sunshinee:7077
> java.io.IOException: Failed to connect to sunshinee:7077
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>  at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at 
> 

[jira] [Assigned] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12718:


Assignee: Xiao Li  (was: Apache Spark)

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12718:


Assignee: Apache Spark  (was: Xiao Li)

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182670#comment-15182670
 ] 

Apache Spark commented on SPARK-12718:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11555

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13712) Add OneVsOne to ML

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13712:


Assignee: (was: Apache Spark)

> Add OneVsOne to ML
> --
>
> Key: SPARK-13712
> URL: https://issues.apache.org/jira/browse/SPARK-13712
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Another Meta method for multi-class classification.
> Most classification algorithms were designed for balanced data.
> The OneVsRest method will generate K models on imbalanced data.
> The OneVsOne will train K*(K-1)/2 models on balanced data.
> OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
> usually result in higher precision.
> But it is much more computationally expensive, although each model are 
> trained on a much smaller dataset. (2/K of total)
> The OneVsOne is implemented in the way OneVsRest did:
> val classifier = new LogisticRegression()
> val ovo = new OneVsOne()
> ovo.setClassifier(classifier)
> val ovoModel = ovo.fit(data)
> val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13712) Add OneVsOne to ML

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182660#comment-15182660
 ] 

Apache Spark commented on SPARK-13712:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11554

> Add OneVsOne to ML
> --
>
> Key: SPARK-13712
> URL: https://issues.apache.org/jira/browse/SPARK-13712
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Another Meta method for multi-class classification.
> Most classification algorithms were designed for balanced data.
> The OneVsRest method will generate K models on imbalanced data.
> The OneVsOne will train K*(K-1)/2 models on balanced data.
> OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
> usually result in higher precision.
> But it is much more computationally expensive, although each model are 
> trained on a much smaller dataset. (2/K of total)
> The OneVsOne is implemented in the way OneVsRest did:
> val classifier = new LogisticRegression()
> val ovo = new OneVsOne()
> ovo.setClassifier(classifier)
> val ovoModel = ovo.fit(data)
> val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13712) Add OneVsOne to ML

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13712:


Assignee: Apache Spark

> Add OneVsOne to ML
> --
>
> Key: SPARK-13712
> URL: https://issues.apache.org/jira/browse/SPARK-13712
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> Another Meta method for multi-class classification.
> Most classification algorithms were designed for balanced data.
> The OneVsRest method will generate K models on imbalanced data.
> The OneVsOne will train K*(K-1)/2 models on balanced data.
> OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
> usually result in higher precision.
> But it is much more computationally expensive, although each model are 
> trained on a much smaller dataset. (2/K of total)
> The OneVsOne is implemented in the way OneVsRest did:
> val classifier = new LogisticRegression()
> val ovo = new OneVsOne()
> ovo.setClassifier(classifier)
> val ovoModel = ovo.fit(data)
> val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13712) Add OneVsOne to ML

2016-03-06 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13712:


 Summary: Add OneVsOne to ML
 Key: SPARK-13712
 URL: https://issues.apache.org/jira/browse/SPARK-13712
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: zhengruifeng
Priority: Minor


Another Meta method for multi-class classification.

Most classification algorithms were designed for balanced data.
The OneVsRest method will generate K models on imbalanced data.
The OneVsOne will train K*(K-1)/2 models on balanced data.

OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
usually result in higher precision.
But it is much more computationally expensive, although each model are trained 
on a much smaller dataset. (2/K of total)


The OneVsOne is implemented in the way OneVsRest did:

val classifier = new LogisticRegression()
val ovo = new OneVsOne()
ovo.setClassifier(classifier)
val ovoModel = ovo.fit(data)
val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182644#comment-15182644
 ] 

Xiao Li commented on SPARK-12718:
-

So far, SQL generation support for Window functions can work well. However, 
qualifier-related issues break a few test cases. 

Because RecoverScopingInfo adds extra subqueries, we need to add a new rule 
after the batch `Canonicalizer` to add/populate correct qualifiers for the 
AttributeReference. Now, I am adding this rule. Thanks!

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13711) Apache Spark driver stopping JVM when master not available

2016-03-06 Thread Era (JIRA)
Era created SPARK-13711:
---

 Summary: Apache Spark driver stopping JVM when master not 
available 
 Key: SPARK-13711
 URL: https://issues.apache.org/jira/browse/SPARK-13711
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0, 1.4.1
Reporter: Era


In my application Java spark context is created with an unavailable master URL 
(you may assume master is down for a maintenance). When creating Java spark 
context it leads to stopping JVM that runs spark driver with JVM exit code 50.

When I checked the logs I found SparkUncaughtExceptionHandler calling the 
System.exit. My program should run forever. 

package test.mains;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

public class CheckJavaSparkContext {

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {

SparkConf conf = new SparkConf();
conf.setAppName("test");
conf.setMaster("spark://sunshinee:7077");

try {
new JavaSparkContext(conf);
} catch (Throwable e) {
System.out.println("Caught an exception : " + e.getMessage());
//e.printStackTrace();
}

System.out.println("Waiting to complete...");
while (true) {
}
}

}

Output log

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0
16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a 
loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0)
16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233
16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233
16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(ps40233); users 
with modify permissions: Set(ps40233)
16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on 
port 55309.
16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started
16/03/04 18:01:21 INFO Remoting: Starting remoting
16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128]
16/03/04 18:01:22 INFO Utils: Successfully started service 
'sparkDriverActorSystem' on port 52128.
16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker
16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster
16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at 
/tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98
16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB
16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator
16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040
16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master 
spark://sunshinee:7077...
16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master 
sunshinee:7077
java.io.IOException: Failed to connect to sunshinee:7077
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at 

[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182584#comment-15182584
 ] 

Xiao Li commented on SPARK-12718:
-

Sure, Thanks!

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182583#comment-15182583
 ] 

Wenchen Fan commented on SPARK-12718:
-

Then finish it, we can consolidate them later.

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182578#comment-15182578
 ] 

Xiao Li commented on SPARK-12718:
-

Hi, [~cloud_fan] Yeah. Almost done. Just let me know if I should continue it. 
Thanks!

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-06 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182565#comment-15182565
 ] 

Masayoshi TSUZUKI commented on SPARK-13710:
---

It shows the similar ERROR message and stacktrace also when exit from spark 
shell.

{noformat}
scala> :quit
16/03/07 13:06:31 INFO SparkUI: Stopped Spark web UI at 
http://192.168.33.129:4040
16/03/07 13:06:31 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
16/03/07 13:06:31 INFO MemoryStore: MemoryStore cleared
16/03/07 13:06:31 INFO BlockManager: BlockManager stopped
16/03/07 13:06:31 INFO BlockManagerMaster: BlockManagerMaster stopped
16/03/07 13:06:31 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
16/03/07 13:06:31 INFO SparkContext: Successfully stopped SparkContext
[WARN] Task failed
java.lang.NoClassDefFoundError: Could not initialize class 
scala.tools.fusesource_embedded.jansi.internal.Kernel32
at 
scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.setConsoleMode(WindowsSupport.java:60)
at 
scala.tools.jline_embedded.WindowsTerminal.setConsoleMode(WindowsTerminal.java:208)
at 
scala.tools.jline_embedded.WindowsTerminal.restore(WindowsTerminal.java:95)
at 
scala.tools.jline_embedded.TerminalSupport$1.run(TerminalSupport.java:52)
at 
scala.tools.jline_embedded.internal.ShutdownHooks.runTasks(ShutdownHooks.java:66)
at 
scala.tools.jline_embedded.internal.ShutdownHooks.access$000(ShutdownHooks.java:22)
at 
scala.tools.jline_embedded.internal.ShutdownHooks$1.run(ShutdownHooks.java:47)

16/03/07 13:06:31 INFO ShutdownHookManager: Shutdown hook called
16/03/07 13:06:31 INFO ShutdownHookManager: Deleting directory 
C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940\repl-f753505f-69cf-4593-bb21-f5aa2683bcca
16/03/07 13:06:31 ERROR ShutdownHookManager: Exception while deleting Spark 
temp dir: 
C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940\repl-f753505f-69cf-4593-bb21-f5aa2683bcca
java.io.IOException: Failed to delete: 
C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940\repl-f753505f-69cf-4593-bb21-f5aa2683bcca
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:935)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:64)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:61)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:61)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:217)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:189)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:189)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:189)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1788)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:189)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:189)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:189)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:189)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:179)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
16/03/07 13:06:31 INFO ShutdownHookManager: Deleting directory 
C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940
16/03/07 13:06:31 ERROR ShutdownHookManager: Exception while deleting Spark 
temp dir: 
C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940
java.io.IOException: Failed to delete: 
C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:935)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:64)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:61)
at 

[jira] [Created] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-06 Thread Masayoshi TSUZUKI (JIRA)
Masayoshi TSUZUKI created SPARK-13710:
-

 Summary: Spark shell shows ERROR when launching on Windows
 Key: SPARK-13710
 URL: https://issues.apache.org/jira/browse/SPARK-13710
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Reporter: Masayoshi TSUZUKI
Priority: Minor


On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message and 
stacktrace.
{noformat}
C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.NoClassDefFoundError: Could not initialize class 
scala.tools.fusesource_embedded.jansi.internal.Kernel32
at 
scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
at 
scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
at 
scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
at 
scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
at 
scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
at 
scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
at 
scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
at 
scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
at 
scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
at 
scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
at 
scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
at scala.util.Try$.apply(Try.scala:192)
at scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
at scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
at scala.collection.immutable.Stream.collect(Stream.scala:435)
at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at 
scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
at org.apache.spark.repl.Main$.doMain(Main.scala:64)
at org.apache.spark.repl.Main$.main(Main.scala:47)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:737)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/03/07 13:05:32 WARN NativeCodeLoader: 

[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182512#comment-15182512
 ] 

Apache Spark commented on SPARK-13600:
--

User 'oliverpierson' has created a pull request for this issue:
https://github.com/apache/spark/pull/11553

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13600:


Assignee: Oliver Pierson  (was: Apache Spark)

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13600:


Assignee: Apache Spark  (was: Oliver Pierson)

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Apache Spark
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182496#comment-15182496
 ] 

Wenchen Fan commented on SPARK-12718:
-

Hi, [~xiaol], are you still working on it? I was working on it before and 
forgot to update this JIRA...

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13709) Spark unable to decode Avro when partitioned

2016-03-06 Thread Chris Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Miller updated SPARK-13709:
-
Description: 
There is a problem decoding Avro data with SparkSQL when partitioned. The 
schema and encoded data are valid -- I'm able to decode the data with the 
avro-tools CLI utility. I'm also able to decode the data with non-partitioned 
SparkSQL tables, Hive, other tools as well... except partitioned SparkSQL 
schemas.

For a simple example, I took the example schema and data found in the Oracle 
documentation here:

*Schema*
{code:javascript}
{
  "type": "record",
  "name": "MemberInfo",
  "namespace": "avro",
  "fields": [
{"name": "name", "type": {
  "type": "record",
  "name": "FullName",
  "fields": [
{"name": "first", "type": "string"},
{"name": "last", "type": "string"}
  ]
}},
{"name": "age", "type": "int"},
{"name": "address", "type": {
  "type": "record",
  "name": "Address",
  "fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "zip", "type": "int"}
  ]
}}
  ]
} 
{code}

*Data*
{code:javascript}
{
  "name": {
"first": "Percival",
"last":  "Lowell"
  },
  "age": 156,
  "address": {
"street": "Mars Hill Rd",
"city": "Flagstaff",
"state": "AZ",
"zip": 86001
  }
}
{code}

*Create* (no partitions - works)
If I create with no partitions, I'm able to query the data just fine.

{code:sql}
CREATE EXTERNAL TABLE IF NOT EXISTS foo
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/data/dir'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc');
{code}

*Create* (partitions -- does NOT work)
If I create with no partitions, and then manually add a partition, all of my 
queries return an error. (I need to manually add partitions because I cannot 
control the structure of the data directories, so dynamic partitioning is not 
an option.)

{code:sql}
CREATE EXTERNAL TABLE IF NOT EXISTS foo
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc');

ALTER TABLE foo ADD PARTITION (ds='1') LOCATION '/path/to/data/dir';
{code}

The error:

{code}
spark-sql> SELECT * FROM foo WHERE ds = '1';

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at 

[jira] [Updated] (SPARK-13709) Spark unable to decode Avro when partitioned

2016-03-06 Thread Chris Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Miller updated SPARK-13709:
-
Description: 
There is a problem decoding Avro data with SparkSQL when partitioned. The 
schema and encoded data are valid -- I'm able to decode the data with the 
avro-tools CLI utility. I'm also able to decode the data with non-partitioned 
SparkSQL tables, Hive, other tools as well... except partitioned SparkSQL 
schemas.

For a simple example, I took the example schema and data found in the Oracle 
documentation here:

*Schema*
{code:javascript}
{
  "type": "record",
  "name": "MemberInfo",
  "namespace": "avro",
  "fields": [
{"name": "name", "type": {
  "type": "record",
  "name": "FullName",
  "fields": [
{"name": "first", "type": "string"},
{"name": "last", "type": "string"}
  ]
}},
{"name": "age", "type": "int"},
{"name": "address", "type": {
  "type": "record",
  "name": "Address",
  "fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "zip", "type": "int"}
  ]
}}
  ]
} 
{code}

*Data*
{code:javascript}
{
  "name": {
"first": "Percival",
"last":  "Lowell"
  },
  "age": 156,
  "address": {
"street": "Mars Hill Rd",
"city": "Flagstaff",
"state": "AZ",
"zip": 86001
  }
}
{code}

*Create* (no partitions - works)
If I create with no partitions, I'm able to query the data just fine.

{code:sql}
CREATE EXTERNAL TABLE IF NOT EXISTS foo
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/data/dir'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc');
{code}

*Create* (partitions -- does NOT work)
If I create with no partitions, and then manually add a partition, all of my 
queries return an error. (I need to manually add partitions because I cannot 
control the structure of the data directories, so dynamic partitioning is not 
an option.)

{code:sql}
CREATE EXTERNAL TABLE IF NOT EXISTS foo
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc');

ALTER TABLE foo ADD PARTITION (ds='1') LOCATION '/path/to/data/dir';
{code}

The error:

{code}
spark-sql> SELECT * FROM foo WHERE ds = '1';

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at 

[jira] [Created] (SPARK-13709) Spark unable to decode Avro when partitioned

2016-03-06 Thread Chris Miller (JIRA)
Chris Miller created SPARK-13709:


 Summary: Spark unable to decode Avro when partitioned
 Key: SPARK-13709
 URL: https://issues.apache.org/jira/browse/SPARK-13709
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Chris Miller


There is a problem decoding Avro data with SparkSQL when partitioned. The 
schema and encoded data are valid -- I'm able to decode the data with the 
avro-tools CLI utility. I'm also able to decode the data with non-partitioned 
SparkSQL tables, Hive, other tools as well... except partitioned SparkSQL 
schemas.

For a simple example, I took the example schema and data found in the Oracle 
documentation here:

**Schema**
{code:javascript}
{
  "type": "record",
  "name": "MemberInfo",
  "namespace": "avro",
  "fields": [
{"name": "name", "type": {
  "type": "record",
  "name": "FullName",
  "fields": [
{"name": "first", "type": "string"},
{"name": "last", "type": "string"}
  ]
}},
{"name": "age", "type": "int"},
{"name": "address", "type": {
  "type": "record",
  "name": "Address",
  "fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "zip", "type": "int"}
  ]
}}
  ]
} 
{code}

**Data**
{code:javascript}
{
  "name": {
"first": "Percival",
"last":  "Lowell"
  },
  "age": 156,
  "address": {
"street": "Mars Hill Rd",
"city": "Flagstaff",
"state": "AZ",
"zip": 86001
  }
}
{code}

**Create** (no partitions - works)
If I create with no partitions, I'm able to query the data just fine.

{code:sql}
CREATE EXTERNAL TABLE IF NOT EXISTS foo
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/data/dir'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc');
{code}

**Create** (partitions -- does NOT work)
If I create with no partitions, and then manually add a partition, all of my 
queries return an error. (I need to manually add partitions because I cannot 
control the structure of the data directories, so dynamic partitioning is not 
an option.)

{code:sql}
CREATE EXTERNAL TABLE IF NOT EXISTS foo
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc');

ALTER TABLE foo ADD PARTITION (ds='1') LOCATION '/path/to/data/dir';
{code}

The error:

{code}
spark-sql> SELECT * FROM foo WHERE ds = '1';

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at 

[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins

2016-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182482#comment-15182482
 ] 

Dongjoon Hyun commented on SPARK-12243:
---

Here is the real Jenkins test time.
{code}
Tests passed in 804 seconds
{code}

> PySpark tests are slow in Jenkins
> -
>
> Key: SPARK-12243
> URL: https://issues.apache.org/jira/browse/SPARK-12243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark, Tests
>Reporter: Josh Rosen
>
> In the Jenkins pull request builder, it looks like PySpark tests take around 
> 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that 
> we run four Python test suites in parallel. We should try to figure out why 
> this is slow and see if there's any easy way to speed things up.
> Note that the PySpark streaming tests take about 5 minutes to run, so 
> best-case we're looking at a 10 minute speedup via further parallelization. 
> We should also try to see whether there are individual slow tests in those 
> Python suites which can be sped up or skipped.
> We could also consider running only the Python 2.6 tests in non-Pyspark pull 
> request builds and reserve testing of all Python versions for builds which 
> touch PySpark-related code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2016-03-06 Thread Taro L. Saito (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182472#comment-15182472
 ] 

Taro L. Saito edited comment on SPARK-5928 at 3/7/16 2:24 AM:
--

FYI. I created LArray library that can handle data larger than 2GB, which is 
the limit of Java byte arrays and mmap files: https://github.com/xerial/larray

It looks like there are several reports showing when this 2GB limit can be 
problematic (especially in processing Spark SQL):
http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29

Let me know if there is anything I can work on. 


was (Author: taroleo):
FYI. I created LArray library that can handle data larger than 2GB, which is 
the limit of Java byte arrays and mmap files: https://github.com/xerial/larray

It looks like there are several reports when this 2GB limit can be problematic 
(especially in processing Spark SQL):
http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29

Let me know if there is anything that I can work on. 

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> 

[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-03-06 Thread Jose Cambronero (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182476#comment-15182476
 ] 

Jose Cambronero commented on SPARK-8884:


[~yuhaoyan] please do! I unfortunately got really busy last semester and this 
in grad school and was not able to continue following up on this. 

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Priority: Minor
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13034) PySpark ml.classification support export/import

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182474#comment-15182474
 ] 

Apache Spark commented on SPARK-13034:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/11552

> PySpark ml.classification support export/import
> ---
>
> Key: SPARK-13034
> URL: https://issues.apache.org/jira/browse/SPARK-13034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/classification.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13034) PySpark ml.classification support export/import

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13034:


Assignee: (was: Apache Spark)

> PySpark ml.classification support export/import
> ---
>
> Key: SPARK-13034
> URL: https://issues.apache.org/jira/browse/SPARK-13034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/classification.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13034) PySpark ml.classification support export/import

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13034:


Assignee: Apache Spark

> PySpark ml.classification support export/import
> ---
>
> Key: SPARK-13034
> URL: https://issues.apache.org/jira/browse/SPARK-13034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/classification.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2016-03-06 Thread Taro L. Saito (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182472#comment-15182472
 ] 

Taro L. Saito commented on SPARK-5928:
--

FYI. I created LArray library that can handle data larger than 2GB, which is 
the limit of Java byte arrays and mmap files: https://github.com/xerial/larray

It looks like there are several reports when this 2GB limit can be problematic 
(especially in processing Spark SQL):
http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29

Let me know if there is anything that I can work on. 

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> 

[jira] [Comment Edited] (SPARK-12243) PySpark tests are slow in Jenkins

2016-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182460#comment-15182460
 ] 

Dongjoon Hyun edited comment on SPARK-12243 at 3/7/16 1:49 AM:
---

According to the log, the total time of all tests are *3077s*. 
So, the minimum required time for 4 processes is *769s*.



was (Author: dongjoon):
According to the log, the total time of all tests are **3077s**. 
So, the minimum required time for 4 processes is **769s**.


> PySpark tests are slow in Jenkins
> -
>
> Key: SPARK-12243
> URL: https://issues.apache.org/jira/browse/SPARK-12243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark, Tests
>Reporter: Josh Rosen
>
> In the Jenkins pull request builder, it looks like PySpark tests take around 
> 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that 
> we run four Python test suites in parallel. We should try to figure out why 
> this is slow and see if there's any easy way to speed things up.
> Note that the PySpark streaming tests take about 5 minutes to run, so 
> best-case we're looking at a 10 minute speedup via further parallelization. 
> We should also try to see whether there are individual slow tests in those 
> Python suites which can be sped up or skipped.
> We could also consider running only the Python 2.6 tests in non-Pyspark pull 
> request builds and reserve testing of all Python versions for builds which 
> touch PySpark-related code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins

2016-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182460#comment-15182460
 ] 

Dongjoon Hyun commented on SPARK-12243:
---

According to the log, the total time of all tests are **3077s**. 
So, the minimum required time for 4 processes is **769s**.


> PySpark tests are slow in Jenkins
> -
>
> Key: SPARK-12243
> URL: https://issues.apache.org/jira/browse/SPARK-12243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark, Tests
>Reporter: Josh Rosen
>
> In the Jenkins pull request builder, it looks like PySpark tests take around 
> 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that 
> we run four Python test suites in parallel. We should try to figure out why 
> this is slow and see if there's any easy way to speed things up.
> Note that the PySpark streaming tests take about 5 minutes to run, so 
> best-case we're looking at a 10 minute speedup via further parallelization. 
> We should also try to see whether there are individual slow tests in those 
> Python suites which can be sped up or skipped.
> We could also consider running only the Python 2.6 tests in non-Pyspark pull 
> request builds and reserve testing of all Python versions for builds which 
> touch PySpark-related code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12243) PySpark tests are slow in Jenkins

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12243:


Assignee: (was: Apache Spark)

> PySpark tests are slow in Jenkins
> -
>
> Key: SPARK-12243
> URL: https://issues.apache.org/jira/browse/SPARK-12243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark, Tests
>Reporter: Josh Rosen
>
> In the Jenkins pull request builder, it looks like PySpark tests take around 
> 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that 
> we run four Python test suites in parallel. We should try to figure out why 
> this is slow and see if there's any easy way to speed things up.
> Note that the PySpark streaming tests take about 5 minutes to run, so 
> best-case we're looking at a 10 minute speedup via further parallelization. 
> We should also try to see whether there are individual slow tests in those 
> Python suites which can be sped up or skipped.
> We could also consider running only the Python 2.6 tests in non-Pyspark pull 
> request builds and reserve testing of all Python versions for builds which 
> touch PySpark-related code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182455#comment-15182455
 ] 

Apache Spark commented on SPARK-12243:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11551

> PySpark tests are slow in Jenkins
> -
>
> Key: SPARK-12243
> URL: https://issues.apache.org/jira/browse/SPARK-12243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark, Tests
>Reporter: Josh Rosen
>
> In the Jenkins pull request builder, it looks like PySpark tests take around 
> 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that 
> we run four Python test suites in parallel. We should try to figure out why 
> this is slow and see if there's any easy way to speed things up.
> Note that the PySpark streaming tests take about 5 minutes to run, so 
> best-case we're looking at a 10 minute speedup via further parallelization. 
> We should also try to see whether there are individual slow tests in those 
> Python suites which can be sped up or skipped.
> We could also consider running only the Python 2.6 tests in non-Pyspark pull 
> request builds and reserve testing of all Python versions for builds which 
> touch PySpark-related code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12243) PySpark tests are slow in Jenkins

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12243:


Assignee: Apache Spark

> PySpark tests are slow in Jenkins
> -
>
> Key: SPARK-12243
> URL: https://issues.apache.org/jira/browse/SPARK-12243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark, Tests
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In the Jenkins pull request builder, it looks like PySpark tests take around 
> 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that 
> we run four Python test suites in parallel. We should try to figure out why 
> this is slow and see if there's any easy way to speed things up.
> Note that the PySpark streaming tests take about 5 minutes to run, so 
> best-case we're looking at a 10 minute speedup via further parallelization. 
> We should also try to see whether there are individual slow tests in those 
> Python suites which can be sped up or skipped.
> We could also consider running only the Python 2.6 tests in non-Pyspark pull 
> request builds and reserve testing of all Python versions for builds which 
> touch PySpark-related code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182445#comment-15182445
 ] 

Xiao Li commented on SPARK-12718:
-

In Window Spec, the possible inputs are:

1. partition by + order by, 
2. order by
3. distribute by + sort by
4. sort by
5. cluster by


> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins

2016-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182442#comment-15182442
 ] 

Dongjoon Hyun commented on SPARK-12243:
---

Hi, [~joshrosen].

According to the recent [Running PySpark tests 
log|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console],
 it seems that the long running test starts at the end due to FIFO queue. In 
that case, I think we can reduce the test time by just starting some long 
running tests first with simple priority queue. 

{code}
...
Finished test(python3.4): pyspark.streaming.tests (213s)
Finished test(pypy): pyspark.sql.tests (92s)
Finished test(pypy): pyspark.streaming.tests (280s)
Tests passed in 962 seconds
{code}

The long tests are the followings:
  * pyspark.tests
  * pyspark.mllib.tests
  * pyspark.streaming.tests

I'll make a PR for this as a first attempt to resolve this JIRA issue.

> PySpark tests are slow in Jenkins
> -
>
> Key: SPARK-12243
> URL: https://issues.apache.org/jira/browse/SPARK-12243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark, Tests
>Reporter: Josh Rosen
>
> In the Jenkins pull request builder, it looks like PySpark tests take around 
> 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that 
> we run four Python test suites in parallel. We should try to figure out why 
> this is slow and see if there's any easy way to speed things up.
> Note that the PySpark streaming tests take about 5 minutes to run, so 
> best-case we're looking at a 10 minute speedup via further parallelization. 
> We should also try to see whether there are individual slow tests in those 
> Python suites which can be sped up or skipped.
> We could also consider running only the Python 2.6 tests in non-Pyspark pull 
> request builds and reserve testing of all Python versions for builds which 
> touch PySpark-related code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182441#comment-15182441
 ] 

Xiao Li commented on SPARK-12718:
-

If users use cluster by clauses, or DISTRIBUTE BY + SORT BY clauses, the 
generated SQL will convert them to Partition By + Order By. 

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182439#comment-15182439
 ] 

Xiao Li commented on SPARK-12718:
-

Will not add extra subquery here. Trying to rebuild the original Window SQL

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6162) Handle missing values in GBM

2016-03-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182434#comment-15182434
 ] 

Joseph K. Bradley commented on SPARK-6162:
--

I agree this will be nice to add someday, but it's less pressing than other 
tasks for now.

> Handle missing values in GBM
> 
>
> Key: SPARK-6162
> URL: https://issues.apache.org/jira/browse/SPARK-6162
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Devesh Parekh
>
> We build a lot of predictive models over data combined from multiple sources, 
> where some entries may not have all sources of data and so some values are 
> missing in each feature vector. Another place this might come up is if you 
> have features from slightly heterogeneous items (or items composed of 
> heterogeneous subcomponents) that share many features in common but may have 
> extra features for different types, and you don't want to manually train 
> models for every different type.
> R's GBM library, which is what we are currently using, deals with this type 
> of data nicely by making "missing" nodes in the decision tree (a surrogate 
> split) for features that can have missing values. We'd like to do the same 
> with MLLib, but LabeledPoint would need to support missing values, and 
> GradientBoostedTrees would need to be modified to deal with them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12731) PySpark docstring cleanup

2016-03-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-12731.
-
Resolution: Won't Fix

> PySpark docstring cleanup
> -
>
> Key: SPARK-12731
> URL: https://issues.apache.org/jira/browse/SPARK-12731
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> We don't currently have any automated checks that our PySpark docstring lines 
> are within pep8/275/276 lenght limits (since the pep8 checker doesn't handle 
> this). As such there are ~400 non-comformant docstring lines. This JIRA is to 
> fix those docstring lines and add a command to lint python to fail on long 
> lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12731) PySpark docstring cleanup

2016-03-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182432#comment-15182432
 ] 

Joseph K. Bradley commented on SPARK-12731:
---

Alright, it sounds like the consensus is that we don't think this is worth 
fixing, so I'll close this.

> PySpark docstring cleanup
> -
>
> Key: SPARK-12731
> URL: https://issues.apache.org/jira/browse/SPARK-12731
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> We don't currently have any automated checks that our PySpark docstring lines 
> are within pep8/275/276 lenght limits (since the pep8 checker doesn't handle 
> this). As such there are ~400 non-comformant docstring lines. This JIRA is to 
> fix those docstring lines and add a command to lint python to fail on long 
> lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13667) Support for specifying custom date format for date and timestamp types

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13667:


Assignee: Apache Spark

> Support for specifying custom date format for date and timestamp types
> --
>
> Key: SPARK-13667
> URL: https://issues.apache.org/jira/browse/SPARK-13667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, CSV data source does not support to parse date and timestamp types 
> in custom format and infer the type of timestamp type  in custom format.
> It looks quite many of users want this feature. It would be great to set 
> custom date format.
> This was reported in spark-csv.
> https://github.com/databricks/spark-csv/issues/279
> https://github.com/databricks/spark-csv/issues/262
> https://github.com/databricks/spark-csv/issues/266
> Currently I submitted a PR for this in spark-csv
> https://github.com/databricks/spark-csv/pull/280



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13667) Support for specifying custom date format for date and timestamp types

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13667:


Assignee: (was: Apache Spark)

> Support for specifying custom date format for date and timestamp types
> --
>
> Key: SPARK-13667
> URL: https://issues.apache.org/jira/browse/SPARK-13667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, CSV data source does not support to parse date and timestamp types 
> in custom format and infer the type of timestamp type  in custom format.
> It looks quite many of users want this feature. It would be great to set 
> custom date format.
> This was reported in spark-csv.
> https://github.com/databricks/spark-csv/issues/279
> https://github.com/databricks/spark-csv/issues/262
> https://github.com/databricks/spark-csv/issues/266
> Currently I submitted a PR for this in spark-csv
> https://github.com/databricks/spark-csv/pull/280



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13667) Support for specifying custom date format for date and timestamp types

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182431#comment-15182431
 ] 

Apache Spark commented on SPARK-13667:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11550

> Support for specifying custom date format for date and timestamp types
> --
>
> Key: SPARK-13667
> URL: https://issues.apache.org/jira/browse/SPARK-13667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, CSV data source does not support to parse date and timestamp types 
> in custom format and infer the type of timestamp type  in custom format.
> It looks quite many of users want this feature. It would be great to set 
> custom date format.
> This was reported in spark-csv.
> https://github.com/databricks/spark-csv/issues/279
> https://github.com/databricks/spark-csv/issues/262
> https://github.com/databricks/spark-csv/issues/266
> Currently I submitted a PR for this in spark-csv
> https://github.com/databricks/spark-csv/pull/280



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-03-06 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182420#comment-15182420
 ] 

yuhao yang commented on SPARK-8884:
---

Hi [~josepablocam]. Do you mind if I continue to work on this? I think this is 
well-written yet I might need to start another PR to finish it. Let me know if 
you still plan to work on it. Thanks.

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Priority: Minor
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12566) GLM model family, link function support in SparkR:::glm

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12566:


Assignee: Apache Spark  (was: yuhao yang)

> GLM model family, link function support in SparkR:::glm
> ---
>
> Key: SPARK-12566
> URL: https://issues.apache.org/jira/browse/SPARK-12566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Critical
>
> This JIRA is for extending the support of MLlib's Generalized Linear Models 
> (GLMs) to more model families and link functions in SparkR. After 
> SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR 
> with support of popular families and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12566) GLM model family, link function support in SparkR:::glm

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12566:


Assignee: yuhao yang  (was: Apache Spark)

> GLM model family, link function support in SparkR:::glm
> ---
>
> Key: SPARK-12566
> URL: https://issues.apache.org/jira/browse/SPARK-12566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Critical
>
> This JIRA is for extending the support of MLlib's Generalized Linear Models 
> (GLMs) to more model families and link functions in SparkR. After 
> SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR 
> with support of popular families and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12566) GLM model family, link function support in SparkR:::glm

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182401#comment-15182401
 ] 

Apache Spark commented on SPARK-12566:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/11549

> GLM model family, link function support in SparkR:::glm
> ---
>
> Key: SPARK-12566
> URL: https://issues.apache.org/jira/browse/SPARK-12566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Critical
>
> This JIRA is for extending the support of MLlib's Generalized Linear Models 
> (GLMs) to more model families and link functions in SparkR. After 
> SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR 
> with support of popular families and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12566) GLM model family, link function support in SparkR:::glm

2016-03-06 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182384#comment-15182384
 ] 

yuhao yang commented on SPARK-12566:


Since we already have a glm in SparkR which is based on LogisticRegressionModel 
and LinearRegressionModel. There're three ways to extend it as I understand:
1. Change the current glm to use GeneralizedLinearRegression. Create another lm 
interface for sparkR, and use LR as the model. 
2. Keep glm R interface. and replace its implementation with GLM. This means R 
can not invoke LR anymore.
2. Keep glm R interface, and combine the implementation with both LR and GLM 
based on different solver parameter.

I'd prefer to use option 1. And I'm gonna send one PR(WIP) for solution 2, 
which can later be adjusted to 1 or 3.


> GLM model family, link function support in SparkR:::glm
> ---
>
> Key: SPARK-12566
> URL: https://issues.apache.org/jira/browse/SPARK-12566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Critical
>
> This JIRA is for extending the support of MLlib's Generalized Linear Models 
> (GLMs) to more model families and link functions in SparkR. After 
> SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR 
> with support of popular families and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13496) Optimizing count distinct changes the resulting column name

2016-03-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182332#comment-15182332
 ] 

Ryan Blue commented on SPARK-13496:
---

I wouldn't say this is a duplicate, though I'm fine with provisionally closing 
this. SPARK-12593 is for generating SQL from logical plans and this is a bug 
report.

> Optimizing count distinct changes the resulting column name
> ---
>
> Key: SPARK-13496
> URL: https://issues.apache.org/jira/browse/SPARK-13496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> SPARK-9241 updated the optimizer to rewrite count distinct. That change uses 
> a count that is no longer distinct because duplicates are eliminated further 
> down in the plan. This caused the name of the column to change:
> {code:title=Spark 1.5.2}
> scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
> res0: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT a): bigint]
> == Physical Plan ==
> TungstenAggregate(key=[], 
> functions=[(count(a#7),mode=Complete,isDistinct=true)], 
> output=[COUNT(DISTINCT a)#9L])
>  TungstenAggregate(key=[a#7], functions=[], output=[a#7])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[a#7], functions=[], output=[a#7])
> LocalTableScan [a#7], [[1]]
> {code}
> {code:title=Spark 1.6.0}
> scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
> res0: org.apache.spark.sql.DataFrame = [count(a): bigint]
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else 
> null),mode=Final,isDistinct=false)], output=[count(a)#31L])
> +- TungstenExchange SinglePartition, None
>+- TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else 
> null),mode=Partial,isDistinct=false)], output=[count#39L])
>   +- TungstenAggregate(key=[a#36,gid#35], functions=[], 
> output=[a#36,gid#35])
>  +- TungstenExchange hashpartitioning(a#36,gid#35,500), None
> +- TungstenAggregate(key=[a#36,gid#35], functions=[], 
> output=[a#36,gid#35])
>+- Expand [List(a#29, 1)], [a#36,gid#35]
>   +- LocalTableScan [a#29], [[1]]
> {code}
> This has broken jobs that used the generated name. For example, 
> {{withColumnRenamed("COUNT(DISTINCT a)", "c")}}.
> I think that the previous generated name is correct, even though the plan has 
> changed.
> [~marmbrus], you may want to take a look. It looks like you reviewed 
> SPARK-9241 and have some context here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-13620:
---

> Avoid reverse DNS lookup for 0.0.0.0 on startup
> ---
>
> Key: SPARK-13620
> URL: https://issues.apache.org/jira/browse/SPARK-13620
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I noticed we spend 5+ seconds during application startup with the following 
> stack trace:
> {code}
> at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
> at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
> at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
> at java.net.InetAddress.getHostName(InetAddress.java:553)
> at java.net.InetAddress.getHostName(InetAddress.java:525)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
> at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
> at org.spark-project.jetty.server.Server.(Server.java:115)
> at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.(SparkContext.scala:481)
> {code}
> Spark wants to start a server on localhost. So it [creates an 
> {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
>  [with hostname 
> {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
>  Spark passes in a hostname string, but Java [recognizes that it's actually 
> an 
> address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
>  and so sets the hostname to {{null}}. So when Jetty [calls 
> {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
>  Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds 
> on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.
> There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set 
> a hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
> used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
>  which is the same, but does not need resolving.
> {code}
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
> System.currentTimeMillis - t0 }
> res0: Long = 5432
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
> res1: Long = 0
> {code}
> I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13609) Support Column Pruning for MapPartitions

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13609:
--
Assignee: Xiao Li

> Support Column Pruning for MapPartitions
> 
>
> Key: SPARK-13609
> URL: https://issues.apache.org/jira/browse/SPARK-13609
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> {code}
> case class OtherTuple(_1: String, _2: Int)
> val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS()
> ds.as[OtherTuple].map(identity[OtherTuple]).explain(true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13647) also check if numeric value is within allowed range in _verify_type

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13647:
--
Assignee: Wenchen Fan

> also check if numeric value is within allowed range in _verify_type
> ---
>
> Key: SPARK-13647
> URL: https://issues.apache.org/jira/browse/SPARK-13647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13620.
---
   Resolution: Not A Problem
Fix Version/s: (was: 2.0.0)

> Avoid reverse DNS lookup for 0.0.0.0 on startup
> ---
>
> Key: SPARK-13620
> URL: https://issues.apache.org/jira/browse/SPARK-13620
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I noticed we spend 5+ seconds during application startup with the following 
> stack trace:
> {code}
> at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
> at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
> at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
> at java.net.InetAddress.getHostName(InetAddress.java:553)
> at java.net.InetAddress.getHostName(InetAddress.java:525)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
> at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
> at org.spark-project.jetty.server.Server.(Server.java:115)
> at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.(SparkContext.scala:481)
> {code}
> Spark wants to start a server on localhost. So it [creates an 
> {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
>  [with hostname 
> {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
>  Spark passes in a hostname string, but Java [recognizes that it's actually 
> an 
> address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
>  and so sets the hostname to {{null}}. So when Jetty [calls 
> {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
>  Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds 
> on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.
> There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set 
> a hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
> used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
>  which is the same, but does not need resolving.
> {code}
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
> System.currentTimeMillis - t0 }
> res0: Long = 5432
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
> res1: Long = 0
> {code}
> I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13598) Remove LeftSemiJoinBNL

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13598:
--
Assignee: Davies Liu

> Remove LeftSemiJoinBNL
> --
>
> Key: SPARK-13598
> URL: https://issues.apache.org/jira/browse/SPARK-13598
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Broadcast left semi join without joining keys is already supported in 
> BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, 
> we should remove that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13574) Improve parquet dictionary decoding for strings

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13574:
--
Assignee: Nong Li

> Improve parquet dictionary decoding for strings
> ---
>
> Key: SPARK-13574
> URL: https://issues.apache.org/jira/browse/SPARK-13574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, the parquet reader will copy the dictionary value for each data 
> value. This is bad for string columns as we explode the dictionary during 
> decode. We should instead, have the data values point to the safe backing 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13630) Add optimizer rule to collapse sorts

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13630:
--
 Priority: Minor  (was: Major)
Fix Version/s: (was: 2.0.0)

> Add optimizer rule to collapse sorts
> 
>
> Key: SPARK-13630
> URL: https://issues.apache.org/jira/browse/SPARK-13630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> It is possible to collapse  adjacent sorts  and keep the last one.This 
> task is to add optimizer rule to collapse adjacent sorts if possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13255) Integrate vectorized parquet scan with whole stage codegen.

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13255:
--
Assignee: Nong Li

> Integrate vectorized parquet scan with whole stage codegen.
> ---
>
> Key: SPARK-13255
> URL: https://issues.apache.org/jira/browse/SPARK-13255
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> The generated whole stage codegen is intended to be run over batches of rows. 
> This task is to integrate ColumnarBatches with whole stage codegen.
> The resulting generated code should look something like:
> {code}
> Iterator input;
> void process() {
>   while (input.hasNext()) {
> ColumnarBatch batch = input.next();
> for (Row: batch) {
>   // Current function
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13685:
--
Fix Version/s: (was: 2.0.0)

> Rename catalog.Catalog to ExternalCatalog
> -
>
> Key: SPARK-13685
> URL: https://issues.apache.org/jira/browse/SPARK-13685
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for mllib

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13677:
--
Component/s: MLlib

> Support Tree-Based Feature Transformation for mllib
> ---
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13668:
--
Component/s: SQL

> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13668:
--
Priority: Minor  (was: Major)

> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13702) Use diamond operator for generic instance creation in Java code

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13702:
--
Component/s: Examples

> Use diamond operator for generic instance creation in Java code
> ---
>
> Key: SPARK-13702
> URL: https://issues.apache.org/jira/browse/SPARK-13702
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Java 7 or higher supports `diamond` operator which replaces the type 
> arguments required to invoke the constructor of a generic class with an empty 
> set of type parameters (<>). Currently, Spark Java code use mixed usage of 
> this. This issue replaces existing codes to use `diamond` operator and add 
> Checkstyle rule.
> {code}
> -List> kafkaStreams = new 
> ArrayList>(numStreams);
> +List> kafkaStreams = new 
> ArrayList<>(numStreams);
> {code}
> {code}
> -Set> edges = new HashSet Integer>>(numEdges);
> +Set> edges = new HashSet<>(numEdges);
> {code}
> *Reference*
> https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13648:
--
   Priority: Minor  (was: Major)
Component/s: SQL
Summary: org.apache.spark.sql.hive.client.VersionsSuite fails 
NoClassDefFoundError on IBM JDK  (was: 
org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError)

> org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on 
> IBM JDK
> 
>
> Key: SPARK-13648
> URL: https://issues.apache.org/jira/browse/SPARK-13648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Fails on vendor specific JVMs ( e.g IBM JVM )
>Reporter: Tim Preece
>Priority: Minor
>
> When running the standard Spark unit tests on the IBM Java SDK the hive 
> VersionsSuite fail with the following error.
> java.lang.NoClassDefFoundError:  org.apache.hadoop.hive.cli.CliSessionState 
> when creating Hive client using classpath: ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13606:
--
Component/s: PySpark

> Error from python worker:   /usr/local/bin/python2.7: undefined symbol: 
> _PyCodec_LookupTextEncoding
> ---
>
> Key: SPARK-13606
> URL: https://issues.apache.org/jira/browse/SPARK-13606
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Avatar Zhang
>
> Error from python worker:
>   /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: 
> undefined symbol: _PyCodec_LookupTextEncoding
> PYTHONPATH was:
>   
> /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13679) Pyspark job fails with Oozie

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13679:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)

> Pyspark job fails with Oozie
> 
>
> Key: SPARK-13679
> URL: https://issues.apache.org/jira/browse/SPARK-13679
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit, YARN
>Affects Versions: 1.6.0
> Environment: Hadoop 2.7.2, Spark 1.6.0 on Yarn, Oozie 4.2.0
> Cluster secured with Kerberos
>Reporter: Alexandre Linte
>Priority: Minor
>
> Hello,
> I'm trying to run pi.py example in a pyspark job with Oozie. Every try I made 
> failed for the same reason: key not found: SPARK_HOME. 
> Note: A scala job works well in the environment with Oozie. 
> The logs on the executors are:
> {noformat}
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/mnt/hd4/hadoop/yarn/local/filecache/145/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/mnt/hd2/hadoop/yarn/local/filecache/155/spark-assembly-1.6.0-hadoop2.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/application/Hadoop/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> log4j:ERROR setFile(null,true) call failed.
> java.io.FileNotFoundException: 
> /mnt/hd7/hadoop/yarn/log/application_1454673025841_13136/container_1454673025841_13136_01_01
>  (Is a directory)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:142)
> at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
> at 
> org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
> at 
> org.apache.hadoop.yarn.ContainerLogAppender.activateOptions(ContainerLogAppender.java:55)
> at 
> org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
> at 
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
> at 
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
> at 
> org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:809)
> at 
> org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:735)
> at 
> org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:615)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:502)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:547)
> at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:483)
> at org.apache.log4j.LogManager.(LogManager.java:127)
> at 
> org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
> at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:285)
> at 
> org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155)
> at 
> org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:132)
> at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:275)
> at 
> org.apache.hadoop.service.AbstractService.(AbstractService.java:43)
> Using properties file: null
> Parsed arguments:
>   master  yarn-master
>   deployMode  cluster
>   executorMemory  null
>   executorCores   null
>   totalExecutorCores  null
>   propertiesFile  null
>   driverMemorynull
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   null
>   primaryResource 
> hdfs://hadoopsandbox/User/toto/WORK/Oozie/pyspark/lib/pi.py
>   namePysparkpi example
>   childArgs   [100]
>   jarsnull
>   packagesnull
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
> Spark properties used, including those specified through
>  --conf and those from the properties file null:
>   spark.executorEnv.SPARK_HOME -> /opt/application/Spark/current
>   

[jira] [Updated] (SPARK-13692) Fix trivial Coverity/Checkstyle defects

2016-03-06 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13692:
--
Description: 
This issue fixes the following potential bugs and Java coding style detected by 
Coverity and Checkstyle.

  * Implement both null and type checking in equals functions.
  * Fix wrong type casting logic in SimpleJavaBean2.equals.
  * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
  * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
  * Fix coding style: Add '{}' to single `for` statement in mllib examples.
  * Remove unused imports in `ColumnarBatch`.
  * Remove unused fields in `ChunkFetchIntegrationSuite`.
  * Add `stop()` to prevent resource leak.


Please note that the last two checkstyle errors exist on newly added commits 
after [SPARK-13583].

  was:
This issue fixes the following potential bugs and Java coding style detected by 
Coverity and Checkstyle.

  * Implement both null and type checking in equals functions.
  * Fix wrong type casting logic in SimpleJavaBean2.equals.
  * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
  * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
  * Fix coding style: Add '{}' to single `for` statement in mllib examples.
  * Remove unused imports in `ColumnarBatch`.
  * Remove unused fields in `ChunkFetchIntegrationSuite`.
  * Add `close()` to prevent resource leak.


Please note that the last two checkstyle errors exist on newly added commits 
after [SPARK-13583].


> Fix trivial Coverity/Checkstyle defects
> ---
>
> Key: SPARK-13692
> URL: https://issues.apache.org/jira/browse/SPARK-13692
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes the following potential bugs and Java coding style detected 
> by Coverity and Checkstyle.
>   * Implement both null and type checking in equals functions.
>   * Fix wrong type casting logic in SimpleJavaBean2.equals.
>   * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
>   * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
>   * Fix coding style: Add '{}' to single `for` statement in mllib examples.
>   * Remove unused imports in `ColumnarBatch`.
>   * Remove unused fields in `ChunkFetchIntegrationSuite`.
>   * Add `stop()` to prevent resource leak.
> Please note that the last two checkstyle errors exist on newly added commits 
> after [SPARK-13583].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13438) Remove by default dash from output paths

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13438.
---
Resolution: Won't Fix

> Remove by default dash from output paths
> 
>
> Key: SPARK-13438
> URL: https://issues.apache.org/jira/browse/SPARK-13438
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Peter Ableda
>
> The current implementation uses the above schema for the saveAsTextFiles, 
> saveAsObjectFiles, saveAsHadoopFiles methods.
> {code}
> "prefix-TIME_IN_MS[.suffix]"
> {code}
> Spark generates the following with a *"/data/timestamp="* prefix.
> {code}
> /data/timestamp=-1455894364000/_SUCCESS
> /data/timestamp=-1455894364000/part-0
> {code}
> I would like to remove the dash from the path to be able to create proper 
> partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13688:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Add option to use dynamic allocation even if spark.executor.instances is set.
> -
>
> Key: SPARK-13688
> URL: https://issues.apache.org/jira/browse/SPARK-13688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>Priority: Minor
>
> When both spark.dynamicAllocation.enabled and spark.executor.instances are 
> set, dynamic resource allocation is disabled (see SPARK-9092). This is a 
> reasonable default, but I think there should be a configuration property to 
> override it because it isn't obvious to users that dynamic allocation and 
> number of executors are mutually exclusive. We see users setting 
> --num-executors because that looks like what they want: a way to get more 
> executors.
> I propose adding a new boolean property, 
> spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation 
> the default when both are set and uses --num-executors as the minimum number 
> of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13688.
---
Resolution: Won't Fix

> Add option to use dynamic allocation even if spark.executor.instances is set.
> -
>
> Key: SPARK-13688
> URL: https://issues.apache.org/jira/browse/SPARK-13688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>Priority: Minor
>
> When both spark.dynamicAllocation.enabled and spark.executor.instances are 
> set, dynamic resource allocation is disabled (see SPARK-9092). This is a 
> reasonable default, but I think there should be a configuration property to 
> override it because it isn't obvious to users that dynamic allocation and 
> number of executors are mutually exclusive. We see users setting 
> --num-executors because that looks like what they want: a way to get more 
> executors.
> I propose adding a new boolean property, 
> spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation 
> the default when both are set and uses --num-executors as the minimum number 
> of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13599) Groovy-all ends up in spark-assembly if hive profile set

2016-03-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182299#comment-15182299
 ] 

Sean Owen commented on SPARK-13599:
---

[~rxin] do you object to me back-porting this to 1.6? I didn't see a resolution 
on the private@ thread about this either way. 

> Groovy-all ends up in spark-assembly if hive profile set
> 
>
> Key: SPARK-13599
> URL: https://issues.apache.org/jira/browse/SPARK-13599
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.0.0
>
>
> If you do a build with {{-Phive,hive-thriftserver}} then the contents of 
> {{org.codehaus.groovy:groovy-all}} gets into the spark-assembly.jar
> This bad because
> * it makes the JAR bigger
> * it makes the build longer
> * it's an uber-JAR itself, so can include things (maybe even conflicting 
> things)
> * It's something else that needs to be kept up to date security-wise



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13496) Optimizing count distinct changes the resulting column name

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13496.
---
Resolution: Duplicate

Provisionally labeling this a duplicate then

> Optimizing count distinct changes the resulting column name
> ---
>
> Key: SPARK-13496
> URL: https://issues.apache.org/jira/browse/SPARK-13496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> SPARK-9241 updated the optimizer to rewrite count distinct. That change uses 
> a count that is no longer distinct because duplicates are eliminated further 
> down in the plan. This caused the name of the column to change:
> {code:title=Spark 1.5.2}
> scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
> res0: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT a): bigint]
> == Physical Plan ==
> TungstenAggregate(key=[], 
> functions=[(count(a#7),mode=Complete,isDistinct=true)], 
> output=[COUNT(DISTINCT a)#9L])
>  TungstenAggregate(key=[a#7], functions=[], output=[a#7])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[a#7], functions=[], output=[a#7])
> LocalTableScan [a#7], [[1]]
> {code}
> {code:title=Spark 1.6.0}
> scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
> res0: org.apache.spark.sql.DataFrame = [count(a): bigint]
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else 
> null),mode=Final,isDistinct=false)], output=[count(a)#31L])
> +- TungstenExchange SinglePartition, None
>+- TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else 
> null),mode=Partial,isDistinct=false)], output=[count#39L])
>   +- TungstenAggregate(key=[a#36,gid#35], functions=[], 
> output=[a#36,gid#35])
>  +- TungstenExchange hashpartitioning(a#36,gid#35,500), None
> +- TungstenAggregate(key=[a#36,gid#35], functions=[], 
> output=[a#36,gid#35])
>+- Expand [List(a#29, 1)], [a#36,gid#35]
>   +- LocalTableScan [a#29], [[1]]
> {code}
> This has broken jobs that used the generated name. For example, 
> {{withColumnRenamed("COUNT(DISTINCT a)", "c")}}.
> I think that the previous generated name is correct, even though the plan has 
> changed.
> [~marmbrus], you may want to take a look. It looks like you reviewed 
> SPARK-9241 and have some context here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13705) UpdateStateByKey Operation documentation incorrectly refers to StatefulNetworkWordCount

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13705:
--
Target Version/s:   (was: 1.6.0)
Priority: Trivial  (was: Minor)
   Fix Version/s: (was: 1.6.0)

Don't set Fix, Target versions

> UpdateStateByKey Operation documentation incorrectly refers to 
> StatefulNetworkWordCount
> ---
>
> Key: SPARK-13705
> URL: https://issues.apache.org/jira/browse/SPARK-13705
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Rishi
>Priority: Trivial
>  Labels: documentation
>
> Doc Issue : Apache Spark Streaming guide elaborates UpdateStateByKey 
> Operation and goes onto give an example. However for complete code sample it 
> referes to 
> {{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache
> /spark/examples/streaming/StatefulNetworkWordCount.scala). 
> StatefulNetworkWordCount.scala has changed to demonstrate more recent API 
> mapWithState.
> This creates confusion in the document. 
> Till the time more detailed explanation of mapWIthState is added the 
> reference to StatefulNetworkWordCount.scala should be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13230.
---
Resolution: Not A Problem

OK, considering this not a problem (in Spark) for now, with the noted 
workaround.

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13700) Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13700.
---
Resolution: Not A Problem

I think this might be best to float on a list first, since I don't think this 
would be implemented. 

Some things like this already exist in {{AsyncRDDActions}} I have the 
impression they're not quite deprecated but also something that isn't going to 
be added to. The semantics are tough to get right.

However you seem to be talking about something else, where you implement N 
expensive synchronous calls in a map operation. You can already do this 
yourself with mapPartitions and a thread pool.

> Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation
> -
>
> Key: SPARK-13700
> URL: https://issues.apache.org/jira/browse/SPARK-13700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paulo Costa
>Priority: Minor
>  Labels: async, features, rdd, transform
>
> Spark is great for synchronous operations.
> But sometimes I need to call a database/web server/etc from my transform, and 
> the Spark pipeline stalls waiting for it.
> Avoiding that would be great!
> I suggest we add a new method RDD.mapAsync(), which can execute these 
> operations concurrently, avoiding the bottleneck.
> I've written a quick'n'dirty implementation of what I have in mind: 
> https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3
> What do you think?
> If you agree with this feature, I can work on a pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13703.
---
Resolution: Not A Problem

Yea, for the moment 2.10 support has not been dropped. Dropping it would mean 
indeed removing files like this.

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13701) MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm))

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13701.
---
Resolution: Not A Problem

Yeah, I don't think Spark works on arm64 because jblas does not. The thing I'm 
not quite sure about is why it doesn't fail earlier and fall back to Java.  On 
both counts, it sounds like something considered not an issue for Spark.

However... jblas hasn't been used in Spark for a while. What version is this? 
I'm closing this, but you can reply.

> MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: 
> org.jblas.NativeBlas.dgemm))
> --
>
> Key: SPARK-13701
> URL: https://issues.apache.org/jira/browse/SPARK-13701
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
> Environment: Ubuntu 14.04 on aarch64
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: arm64, porting
>
> jblas fails on arm64.
> {code}
> ALSSuite:
> Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.mllib.recommendation.ALSSuite *** ABORTED *** (112 
> milliseconds)
>   java.lang.UnsatisfiedLinkError: 
> org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V
>   at org.jblas.NativeBlas.dgemm(Native Method)
>   at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247)
>   at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781)
>   at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138)
>   at 
> org.apache.spark.mllib.recommendation.ALSSuite$.generateRatings(ALSSuite.scala:74)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13708) Null pointer Exception while starting spark shell

2016-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13708.
---
Resolution: Invalid

Please ask questions on u...@spark.apache.org
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

You'd have to say a lot more like how you built Spark, how you're running it, 
etc.

> Null pointer Exception while starting spark shell
> -
>
> Key: SPARK-13708
> URL: https://issues.apache.org/jira/browse/SPARK-13708
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 1.6.0
> Environment: windows platform
>Reporter: Sowmya Dureddy
>
> 16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in 
> CLASSPATH (or one of dependencies)
> 16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in 
> CLASSPATH (or one of dependencies)
> 16/03/06 11:49:24 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/03/06 11:49:24 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> 16/03/06 11:49:25 WARN : Your hostname, LAPTOP-B9GV6F82 resolves to a 
> loopback/non-reachable address: fe80:0:0:0:1cbd:55b:b3e4:8e04%12, but we 
> couldn't find any external IP address!
> java.lang.RuntimeException: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
> at 
> org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
> at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
> at 
> org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
> at 
> org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> 

[jira] [Created] (SPARK-13708) Null pointer Exception while starting spark shell

2016-03-06 Thread Sowmya Dureddy (JIRA)
Sowmya Dureddy created SPARK-13708:
--

 Summary: Null pointer Exception while starting spark shell
 Key: SPARK-13708
 URL: https://issues.apache.org/jira/browse/SPARK-13708
 Project: Spark
  Issue Type: Question
  Components: Spark Shell
Affects Versions: 1.6.0
 Environment: windows platform
Reporter: Sowmya Dureddy


16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
16/03/06 11:49:24 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
16/03/06 11:49:24 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
16/03/06 11:49:25 WARN : Your hostname, LAPTOP-B9GV6F82 resolves to a 
loopback/non-reachable address: fe80:0:0:0:1cbd:55b:b3e4:8e04%12, but we 
couldn't find any external IP address!
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at 
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at 
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
at 
org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown 

[jira] [Comment Edited] (SPARK-10548) Concurrent execution in SQL does not work

2016-03-06 Thread nicerobot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182265#comment-15182265
 ] 

nicerobot edited comment on SPARK-10548 at 3/6/16 6:47 PM:
---

I might be misunderstanding the solution but i'm not clear how the 
implementation addresses the problem. The issue appears to be that the 
ThreadLocal property {{"spark.sql.execution.id"}} is not handled properly 
(cleaned up) in thread-pooled environments. The implemented solution is 
essentially in {{SparkContext}}

{code}
  // Note: make a clone such that changes in the parent properties aren't 
reflected in
  // the those of the children threads, which has confusing semantics 
(SPARK-10563).
  SerializationUtils.clone(parent).asInstanceOf[Properties]
{code}

But from what I can tell, the problem isn't related to parent/child threads. 
It's that {{localProperties}}' {{"spark.sql.execution.id"}} key is retained 
after a thread completes. When that thread is returned to the pool and reused 
by another execution, the execution id will remain because it's part of the 
SparkContext's {{localProperties}}. It seems like a 
{{"spark.sql.execution.id"}} should be local to an execution context instance 
(a {{QueryExecution}}?), not global to a thread nor specifically a property of 
a SQLContext/SparkContext.


was (Author: nicerobot):
I might be misunderstanding the solution but i'm not clear how the 
implementation addresses the problem. The issue appears to be that the 
ThreadLocal property "spark.sql.execution.id" is not handled properly in 
thread-pooled environments. The implemented solution is essentially in 
{{SparkContext}}

{code}
  // Note: make a clone such that changes in the parent properties aren't 
reflected in
  // the those of the children threads, which has confusing semantics 
(SPARK-10563).
  SerializationUtils.clone(parent).asInstanceOf[Properties]
{code}

But from what I can tell, the problem isn't related to parent/child threads. 
It's that {{localProperties}}' {{"spark.sql.execution.id"}} key is retained 
after a thread completes. When that thread is returned to the pool and reused 
by another execution, the execution id will remain because it's part of the 
SparkContext's {{localProperties}}. It seems like a 
{{"spark.sql.execution.id"}} should be local to an execution context instance 
(a {{QueryExecution}}?), not global to a thread nor specifically a property of 
a SQLContext/SparkContext.

> Concurrent execution in SQL does not work
> -
>
> Key: SPARK-10548
> URL: https://issues.apache.org/jira/browse/SPARK-10548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> From the mailing list:
> {code}
> future { df1.count() } 
> future { df2.count() } 
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> === edit ===
> Simple reproduction:
> {code}
> (1 to 100).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10548) Concurrent execution in SQL does not work

2016-03-06 Thread nicerobot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182265#comment-15182265
 ] 

nicerobot commented on SPARK-10548:
---

I might be misunderstanding the solution but i'm not clear how the 
implementation addresses the problem. The issue appears to be that the 
ThreadLocal property "spark.sql.execution.id" is not handled properly in 
thread-pooled environments. The implemented solution is essentially in 
{{SparkContext}}

{code}
  // Note: make a clone such that changes in the parent properties aren't 
reflected in
  // the those of the children threads, which has confusing semantics 
(SPARK-10563).
  SerializationUtils.clone(parent).asInstanceOf[Properties]
{code}

But from what I can tell, the problem isn't related to parent/child threads. 
It's that {{localProperties}}' {{"spark.sql.execution.id"}} key is retained 
after a thread completes. When that thread is returned to the pool and reused 
by another execution, the execution id will remain because it's part of the 
SparkContext's {{localProperties}}. It seems like a 
{{"spark.sql.execution.id"}} should be local to an execution context instance 
(a {{QueryExecution}}?), not global to a thread nor specifically a property of 
a SQLContext/SparkContext.

> Concurrent execution in SQL does not work
> -
>
> Key: SPARK-10548
> URL: https://issues.apache.org/jira/browse/SPARK-10548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> From the mailing list:
> {code}
> future { df1.count() } 
> future { df2.count() } 
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> === edit ===
> Simple reproduction:
> {code}
> (1 to 100).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2016-03-06 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182262#comment-15182262
 ] 

Neelesh Srinivas Salian commented on SPARK-8480:



Looking into this. Will post a PR for review soon.
Thank you.

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13707) Streaming UI tab misleading for window operations

2016-03-06 Thread Jatin Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182259#comment-15182259
 ] 

Jatin Kumar edited comment on SPARK-13707 at 3/6/16 6:29 PM:
-

The records shown in image are actually the 2 sec batches which share boundary 
with the 120sec window batch.


was (Author: kjatin):
The records shown in image are of the 2 sec batch which share boundry with the 
120sec window batch.

> Streaming UI tab misleading for window operations
> -
>
> Key: SPARK-13707
> URL: https://issues.apache.org/jira/browse/SPARK-13707
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jatin Kumar
> Attachments: Screen Shot 2016-03-06 at 11.09.55 pm.png
>
>
> 'Streaming' tab on spark UI is misleading when the job has a window operation 
> which changes the batch duration from original streaming context batch 
> duration.
> For instance consider:
> {code:java}
> val streamingContext = new StreamingContext(sparkConfig, Seconds(2))
> val totalVideoImps = streamingContext.sparkContext.accumulator(0, 
> "TotalVideoImpressions")
> val totalImps = streamingContext.sparkContext.accumulator(0, 
> "TotalImpressions")
> val stream = KafkaReader.KafkaDirectStream(streamingContext)
> stream.map(KafkaAdLogParser.parseAdLogRecord)
>   .filter(record => {
> totalImps += 1
> KafkaAdLogParser.isVideoRecord(record)
>   })
>   .map(record => {
> totalVideoImps += 1
> record.url
>   })
>   .window(Seconds(120))
>   .count().foreachRDD((rdd, time) => {
>   println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds))
>   println("Count: " + rdd.collect()(0))
>   println("Total Impressions: " + totalImps.value)
>   totalImps.setValue(0)
>   println("Total Video Impressions: " + totalVideoImps.value)
>   totalVideoImps.setValue(0)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> Batch Size before window operation is 2 sec and then after window batches are 
> of 120 seconds each.
> --
> Above code printed following for my application whereas the UI showed 
> different numbers.
> {noformat}
> Timestamp: 2016-03-06 12:02:56,000
> Count: 362195
> Total Impressions: 16882431
> Total Video Impressions: 362195
> Timestamp: 2016-03-06 12:04:56,000
> Count: 367168
> Total Impressions: 19480293
> Total Video Impressions: 367168
> Timestamp: 2016-03-06 12:06:56,000
> Count: 177711
> Total Impressions: 10196677
> Total Video Impressions: 177711
> {noformat}
> whereas the spark UI shows different numbers as attached in the image. Also 
> if we check the start and end index of kafka partition offsets reported by 
> subsequent batch entries on UI, they do not result in all overall continuous 
> range. All numbers are fine if we remove the window operation though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13707) Streaming UI tab misleading for window operations

2016-03-06 Thread Jatin Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jatin Kumar updated SPARK-13707:

Attachment: Screen Shot 2016-03-06 at 11.09.55 pm.png

The records shown in image are of the 2 sec batch which share boundry with the 
120sec window batch.

> Streaming UI tab misleading for window operations
> -
>
> Key: SPARK-13707
> URL: https://issues.apache.org/jira/browse/SPARK-13707
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jatin Kumar
> Attachments: Screen Shot 2016-03-06 at 11.09.55 pm.png
>
>
> 'Streaming' tab on spark UI is misleading when the job has a window operation 
> which changes the batch duration from original streaming context batch 
> duration.
> For instance consider:
> {code:java}
> val streamingContext = new StreamingContext(sparkConfig, Seconds(2))
> val totalVideoImps = streamingContext.sparkContext.accumulator(0, 
> "TotalVideoImpressions")
> val totalImps = streamingContext.sparkContext.accumulator(0, 
> "TotalImpressions")
> val stream = KafkaReader.KafkaDirectStream(streamingContext)
> stream.map(KafkaAdLogParser.parseAdLogRecord)
>   .filter(record => {
> totalImps += 1
> KafkaAdLogParser.isVideoRecord(record)
>   })
>   .map(record => {
> totalVideoImps += 1
> record.url
>   })
>   .window(Seconds(120))
>   .count().foreachRDD((rdd, time) => {
>   println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds))
>   println("Count: " + rdd.collect()(0))
>   println("Total Impressions: " + totalImps.value)
>   totalImps.setValue(0)
>   println("Total Video Impressions: " + totalVideoImps.value)
>   totalVideoImps.setValue(0)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> Batch Size before window operation is 2 sec and then after window batches are 
> of 120 seconds each.
> --
> Above code printed following for my application whereas the UI showed 
> different numbers.
> {noformat}
> Timestamp: 2016-03-06 12:02:56,000
> Count: 362195
> Total Impressions: 16882431
> Total Video Impressions: 362195
> Timestamp: 2016-03-06 12:04:56,000
> Count: 367168
> Total Impressions: 19480293
> Total Video Impressions: 367168
> Timestamp: 2016-03-06 12:06:56,000
> Count: 177711
> Total Impressions: 10196677
> Total Video Impressions: 177711
> {noformat}
> whereas the spark UI shows different numbers as attached in the image. Also 
> if we check the start and end index of kafka partition offsets reported by 
> subsequent batch entries on UI, they do not result in all overall continuous 
> range. All numbers are fine if we remove the window operation though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13707) Streaming UI tab misleading for window operations

2016-03-06 Thread Jatin Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182257#comment-15182257
 ] 

Jatin Kumar commented on SPARK-13707:
-

Ideally all 2 sec batches should be linked to the final 120 sec batch and one 
should be able to browse them from UI but I am not aware of the design 
decisions taken here as it can get quite complex in case of multiple window 
operations.

I would like to work on a fix for this if we can decide on what the behavior 
should be :)

> Streaming UI tab misleading for window operations
> -
>
> Key: SPARK-13707
> URL: https://issues.apache.org/jira/browse/SPARK-13707
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jatin Kumar
>
> 'Streaming' tab on spark UI is misleading when the job has a window operation 
> which changes the batch duration from original streaming context batch 
> duration.
> For instance consider:
> {code:java}
> val streamingContext = new StreamingContext(sparkConfig, Seconds(2))
> val totalVideoImps = streamingContext.sparkContext.accumulator(0, 
> "TotalVideoImpressions")
> val totalImps = streamingContext.sparkContext.accumulator(0, 
> "TotalImpressions")
> val stream = KafkaReader.KafkaDirectStream(streamingContext)
> stream.map(KafkaAdLogParser.parseAdLogRecord)
>   .filter(record => {
> totalImps += 1
> KafkaAdLogParser.isVideoRecord(record)
>   })
>   .map(record => {
> totalVideoImps += 1
> record.url
>   })
>   .window(Seconds(120))
>   .count().foreachRDD((rdd, time) => {
>   println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds))
>   println("Count: " + rdd.collect()(0))
>   println("Total Impressions: " + totalImps.value)
>   totalImps.setValue(0)
>   println("Total Video Impressions: " + totalVideoImps.value)
>   totalVideoImps.setValue(0)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> Batch Size before window operation is 2 sec and then after window batches are 
> of 120 seconds each.
> --
> Above code printed following for my application whereas the UI showed 
> different numbers.
> {noformat}
> Timestamp: 2016-03-06 12:02:56,000
> Count: 362195
> Total Impressions: 16882431
> Total Video Impressions: 362195
> Timestamp: 2016-03-06 12:04:56,000
> Count: 367168
> Total Impressions: 19480293
> Total Video Impressions: 367168
> Timestamp: 2016-03-06 12:06:56,000
> Count: 177711
> Total Impressions: 10196677
> Total Video Impressions: 177711
> {noformat}
> whereas the spark UI shows different numbers as attached in the image. Also 
> if we check the start and end index of kafka partition offsets reported by 
> subsequent batch entries on UI, they do not result in all overall continuous 
> range. All numbers are fine if we remove the window operation though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13707) Streaming UI tab misleading for window operations

2016-03-06 Thread Jatin Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jatin Kumar updated SPARK-13707:

Description: 
'Streaming' tab on spark UI is misleading when the job has a window operation 
which changes the batch duration from original streaming context batch duration.

For instance consider:
{code:java}
val streamingContext = new StreamingContext(sparkConfig, Seconds(2))

val totalVideoImps = streamingContext.sparkContext.accumulator(0, 
"TotalVideoImpressions")
val totalImps = streamingContext.sparkContext.accumulator(0, "TotalImpressions")

val stream = KafkaReader.KafkaDirectStream(streamingContext)
stream.map(KafkaAdLogParser.parseAdLogRecord)
  .filter(record => {
totalImps += 1
KafkaAdLogParser.isVideoRecord(record)
  })
  .map(record => {
totalVideoImps += 1
record.url
  })
  .window(Seconds(120))
  .count().foreachRDD((rdd, time) => {
  println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds))
  println("Count: " + rdd.collect()(0))
  println("Total Impressions: " + totalImps.value)
  totalImps.setValue(0)
  println("Total Video Impressions: " + totalVideoImps.value)
  totalVideoImps.setValue(0)
})
streamingContext.start()
streamingContext.awaitTermination()
{code}
Batch Size before window operation is 2 sec and then after window batches are 
of 120 seconds each.
--

Above code printed following for my application whereas the UI showed different 
numbers.
{noformat}
Timestamp: 2016-03-06 12:02:56,000
Count: 362195
Total Impressions: 16882431
Total Video Impressions: 362195

Timestamp: 2016-03-06 12:04:56,000
Count: 367168
Total Impressions: 19480293
Total Video Impressions: 367168

Timestamp: 2016-03-06 12:06:56,000
Count: 177711
Total Impressions: 10196677
Total Video Impressions: 177711
{noformat}

whereas the spark UI shows different numbers as attached in the image. Also if 
we check the start and end index of kafka partition offsets reported by 
subsequent batch entries on UI, they do not result in all overall continuous 
range. All numbers are fine if we remove the window operation though.

  was:
'Streaming' tab on spark UI is misleading when the job has a window operation 
which changes the batch duration from original streaming context batch duration.

For instance consider:
{{code:java}}
val streamingContext = new StreamingContext(sparkConfig, Seconds(2))

val totalVideoImps = streamingContext.sparkContext.accumulator(0, 
"TotalVideoImpressions")
val totalImps = streamingContext.sparkContext.accumulator(0, "TotalImpressions")

val stream = KafkaReader.KafkaDirectStream(streamingContext)
stream.map(KafkaAdLogParser.parseAdLogRecord)
  .filter(record => {
totalImps += 1
KafkaAdLogParser.isVideoRecord(record)
  })
  .map(record => {
totalVideoImps += 1
record.url
  })
  .window(Seconds(120))
  .count().foreachRDD((rdd, time) => {
  println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds))
  println("Count: " + rdd.collect()(0))
  println("Total Impressions: " + totalImps.value)
  totalImps.setValue(0)
  println("Total Video Impressions: " + totalVideoImps.value)
  totalVideoImps.setValue(0)
})
streamingContext.start()
streamingContext.awaitTermination()
{{code}}
Batch Size before window operation is 2 sec and then after window batches are 
of 120 seconds each.
--

Above code printed following for my application whereas the UI showed different 
numbers.
{{noformat}}
Timestamp: 2016-03-06 12:02:56,000
Count: 362195
Total Impressions: 16882431
Total Video Impressions: 362195

Timestamp: 2016-03-06 12:04:56,000
Count: 367168
Total Impressions: 19480293
Total Video Impressions: 367168

Timestamp: 2016-03-06 12:06:56,000
Count: 177711
Total Impressions: 10196677
Total Video Impressions: 177711
{{noformat}}

whereas the spark UI shows different numbers as attached in the image. Also if 
we check the start and end index of kafka partition offsets reported by 
subsequent batch entries on UI, they do not result in all overall continuous 
range. All numbers are fine if we remove the window operation though.


> Streaming UI tab misleading for window operations
> -
>
> Key: SPARK-13707
> URL: https://issues.apache.org/jira/browse/SPARK-13707
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jatin Kumar
>
> 'Streaming' tab on spark UI is misleading when the job has a window operation 
> which changes the batch duration from original streaming context batch 
> duration.
> For instance consider:
> {code:java}
> val streamingContext = new StreamingContext(sparkConfig, Seconds(2))
> val totalVideoImps = streamingContext.sparkContext.accumulator(0, 
> "TotalVideoImpressions")
> val totalImps = streamingContext.sparkContext.accumulator(0, 
> "TotalImpressions")
> val stream 

[jira] [Created] (SPARK-13707) Streaming UI tab misleading for window operations

2016-03-06 Thread Jatin Kumar (JIRA)
Jatin Kumar created SPARK-13707:
---

 Summary: Streaming UI tab misleading for window operations
 Key: SPARK-13707
 URL: https://issues.apache.org/jira/browse/SPARK-13707
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Jatin Kumar


'Streaming' tab on spark UI is misleading when the job has a window operation 
which changes the batch duration from original streaming context batch duration.

For instance consider:
{{code:java}}
val streamingContext = new StreamingContext(sparkConfig, Seconds(2))

val totalVideoImps = streamingContext.sparkContext.accumulator(0, 
"TotalVideoImpressions")
val totalImps = streamingContext.sparkContext.accumulator(0, "TotalImpressions")

val stream = KafkaReader.KafkaDirectStream(streamingContext)
stream.map(KafkaAdLogParser.parseAdLogRecord)
  .filter(record => {
totalImps += 1
KafkaAdLogParser.isVideoRecord(record)
  })
  .map(record => {
totalVideoImps += 1
record.url
  })
  .window(Seconds(120))
  .count().foreachRDD((rdd, time) => {
  println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds))
  println("Count: " + rdd.collect()(0))
  println("Total Impressions: " + totalImps.value)
  totalImps.setValue(0)
  println("Total Video Impressions: " + totalVideoImps.value)
  totalVideoImps.setValue(0)
})
streamingContext.start()
streamingContext.awaitTermination()
{{code}}
Batch Size before window operation is 2 sec and then after window batches are 
of 120 seconds each.
--

Above code printed following for my application whereas the UI showed different 
numbers.
{{noformat}}
Timestamp: 2016-03-06 12:02:56,000
Count: 362195
Total Impressions: 16882431
Total Video Impressions: 362195

Timestamp: 2016-03-06 12:04:56,000
Count: 367168
Total Impressions: 19480293
Total Video Impressions: 367168

Timestamp: 2016-03-06 12:06:56,000
Count: 177711
Total Impressions: 10196677
Total Video Impressions: 177711
{{noformat}}

whereas the spark UI shows different numbers as attached in the image. Also if 
we check the start and end index of kafka partition offsets reported by 
subsequent batch entries on UI, they do not result in all overall continuous 
range. All numbers are fine if we remove the window operation though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'

2016-03-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13697.

   Resolution: Fixed
Fix Version/s: 1.4.2
   1.6.1
   1.5.3
   2.0.0

Issue resolved by pull request 11535
[https://github.com/apache/spark/pull/11535]

> TransformFunctionSerializer.loads doesn't restore the function's module name 
> if it's '__main__'
> ---
>
> Key: SPARK-13697
> URL: https://issues.apache.org/jira/browse/SPARK-13697
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0, 1.5.3, 1.6.1, 1.4.2
>
>
> Here is a reproducer
> {code}
> >>> from pyspark.streaming import StreamingContext
> >>> from pyspark.streaming.util import TransformFunction
> >>> ssc = StreamingContext(sc, 1)
> >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
> >>> func.rdd_wrapper(lambda x: x)
> TransformFunction( at 0x106ac8b18>)
> >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
> >>> func.rdd_wrap_func, func.deserializers))) 
> >>> func2 = ssc._transformerSerializer.loads(bytes)
> >>> print(func2.func.__module__)
> None
> >>> print(func2.rdd_wrap_func.__module__)
> None
> >>> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns

2016-03-06 Thread Harsh Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh Gupta updated SPARK-12313:

Comment: was deleted

(was: [~liancheng] Should this issue occur after your PR 
https://github.com/apache/spark/pull/7492 ?)

> getPartitionsByFilter doesnt handle predicates on all / multiple Partition 
> Columns
> --
>
> Key: SPARK-12313
> URL: https://issues.apache.org/jira/browse/SPARK-12313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Gobinathan SP
>Assignee: Harsh Gupta
>Priority: Critical
>
> When enabled spark.sql.hive.metastorePartitionPruning, the 
> getPartitionsByFilter is used
> For a table partitioned by p1 and p2, when triggered hc.sql("select col 
> from tabl1 where p1='p1V' and p2= 'p2V' ")
> The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' 
> and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as 
> filter string.
> On these cases the partitions are not returned from Hive's 
> getPartitionsByFilter method. As a result, for the sql, the number of 
> returned rows is always zero. 
> However, filter on a single column always works. Probalbly  it doesn't come 
> through this route
> I'm using Oracle for Metstore V0.13.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns

2016-03-06 Thread Harsh Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182221#comment-15182221
 ] 

Harsh Gupta commented on SPARK-12313:
-

Hi [~lian cheng] . Should this issue occur even after your PR 
https://github.com/apache/spark/pull/7492 ?

> getPartitionsByFilter doesnt handle predicates on all / multiple Partition 
> Columns
> --
>
> Key: SPARK-12313
> URL: https://issues.apache.org/jira/browse/SPARK-12313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Gobinathan SP
>Assignee: Harsh Gupta
>Priority: Critical
>
> When enabled spark.sql.hive.metastorePartitionPruning, the 
> getPartitionsByFilter is used
> For a table partitioned by p1 and p2, when triggered hc.sql("select col 
> from tabl1 where p1='p1V' and p2= 'p2V' ")
> The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' 
> and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as 
> filter string.
> On these cases the partitions are not returned from Hive's 
> getPartitionsByFilter method. As a result, for the sql, the number of 
> returned rows is always zero. 
> However, filter on a single column always works. Probalbly  it doesn't come 
> through this route
> I'm using Oracle for Metstore V0.13.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns

2016-03-06 Thread Harsh Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182220#comment-15182220
 ] 

Harsh Gupta commented on SPARK-12313:
-

[~liancheng] Should this issue occur after your PR 
https://github.com/apache/spark/pull/7492 ?

> getPartitionsByFilter doesnt handle predicates on all / multiple Partition 
> Columns
> --
>
> Key: SPARK-12313
> URL: https://issues.apache.org/jira/browse/SPARK-12313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Gobinathan SP
>Assignee: Harsh Gupta
>Priority: Critical
>
> When enabled spark.sql.hive.metastorePartitionPruning, the 
> getPartitionsByFilter is used
> For a table partitioned by p1 and p2, when triggered hc.sql("select col 
> from tabl1 where p1='p1V' and p2= 'p2V' ")
> The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' 
> and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as 
> filter string.
> On these cases the partitions are not returned from Hive's 
> getPartitionsByFilter method. As a result, for the sql, the number of 
> returned rows is always zero. 
> However, filter on a single column always works. Probalbly  it doesn't come 
> through this route
> I'm using Oracle for Metstore V0.13.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13701) MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm))

2016-03-06 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182214#comment-15182214
 ] 

Santiago M. Mola commented on SPARK-13701:
--

Installed gfortran. Now it fails on NLSSuite, then ALSSuite succeeds.

{code}
[info] NNLSSuite:
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.mllib.optimization.NNLSSuite *** ABORTED *** (68 milliseconds)
[info]   java.lang.UnsatisfiedLinkError: 
org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V
[info]   at org.jblas.NativeBlas.dgemm(Native Method)
[info]   at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247)
[info]   at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781)
[info]   at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138)
[info]   at 
org.apache.spark.mllib.optimization.NNLSSuite.genOnesData(NNLSSuite.scala:33)
[info]   at 
org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(NNLSSuite.scala:56)
[info]   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
[info]   at 
org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2.apply$mcV$sp(NNLSSuite.scala:55)
[info]   at 
org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2.apply(NNLSSuite.scala:45)
[info]   at 
org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2.apply(NNLSSuite.scala:45)
{code}

> MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: 
> org.jblas.NativeBlas.dgemm))
> --
>
> Key: SPARK-13701
> URL: https://issues.apache.org/jira/browse/SPARK-13701
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
> Environment: Ubuntu 14.04 on aarch64
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: arm64, porting
>
> jblas fails on arm64.
> {code}
> ALSSuite:
> Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.mllib.recommendation.ALSSuite *** ABORTED *** (112 
> milliseconds)
>   java.lang.UnsatisfiedLinkError: 
> org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V
>   at org.jblas.NativeBlas.dgemm(Native Method)
>   at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247)
>   at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781)
>   at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138)
>   at 
> org.apache.spark.mllib.recommendation.ALSSuite$.generateRatings(ALSSuite.scala:74)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13706) Python Example for Train Validation Split Missing

2016-03-06 Thread Jeremy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy updated SPARK-13706:
---
Description: An example of how to use TrainValidationSplit in pyspark needs 
to be added. Should be consistent with the current examples. I'll submit a PR.  
(was: And example of how to use TrainValidationSplit in pyspark needs to be 
added. Should be consistent with the current examples. I'll submit a PR.)

> Python Example for Train Validation Split Missing
> -
>
> Key: SPARK-13706
> URL: https://issues.apache.org/jira/browse/SPARK-13706
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Reporter: Jeremy
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> An example of how to use TrainValidationSplit in pyspark needs to be added. 
> Should be consistent with the current examples. I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13706) Python Example for Train Validation Split Missing

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182212#comment-15182212
 ] 

Apache Spark commented on SPARK-13706:
--

User 'JeremyNixon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11547

> Python Example for Train Validation Split Missing
> -
>
> Key: SPARK-13706
> URL: https://issues.apache.org/jira/browse/SPARK-13706
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Reporter: Jeremy
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> And example of how to use TrainValidationSplit in pyspark needs to be added. 
> Should be consistent with the current examples. I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13706) Python Example for Train Validation Split Missing

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13706:


Assignee: (was: Apache Spark)

> Python Example for Train Validation Split Missing
> -
>
> Key: SPARK-13706
> URL: https://issues.apache.org/jira/browse/SPARK-13706
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Reporter: Jeremy
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> And example of how to use TrainValidationSplit in pyspark needs to be added. 
> Should be consistent with the current examples. I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13706) Python Example for Train Validation Split Missing

2016-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13706:


Assignee: Apache Spark

> Python Example for Train Validation Split Missing
> -
>
> Key: SPARK-13706
> URL: https://issues.apache.org/jira/browse/SPARK-13706
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Reporter: Jeremy
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> And example of how to use TrainValidationSplit in pyspark needs to be added. 
> Should be consistent with the current examples. I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13706) Python Example for Train Validation Split Missing

2016-03-06 Thread Jeremy (JIRA)
Jeremy created SPARK-13706:
--

 Summary: Python Example for Train Validation Split Missing
 Key: SPARK-13706
 URL: https://issues.apache.org/jira/browse/SPARK-13706
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Reporter: Jeremy
Priority: Minor


And example of how to use TrainValidationSplit in pyspark needs to be added. 
Should be consistent with the current examples. I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >