[jira] [Created] (SPARK-13713) Replace ANTLR3 SQL parser by a ANTLR4 SQL parser
Herman van Hovell created SPARK-13713: - Summary: Replace ANTLR3 SQL parser by a ANTLR4 SQL parser Key: SPARK-13713 URL: https://issues.apache.org/jira/browse/SPARK-13713 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Herman van Hovell Replace the current SQL Parser by a version written in ANTLR4. This has various advantages: * Much simpler structure * No code blowup * Reduction in lines of code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13711) Apache Spark driver stopping JVM when master not available
[ https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182681#comment-15182681 ] Shixiong Zhu commented on SPARK-13711: -- Sorry, I misread the description. Did you run in the client mode? I think SparkUncaughtExceptionHandler should not be set for driver. > Apache Spark driver stopping JVM when master not available > --- > > Key: SPARK-13711 > URL: https://issues.apache.org/jira/browse/SPARK-13711 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1, 1.6.0 >Reporter: Era > > In my application Java spark context is created with an unavailable master > URL (you may assume master is down for a maintenance). When creating Java > spark context it leads to stopping JVM that runs spark driver with JVM exit > code 50. > When I checked the logs I found SparkUncaughtExceptionHandler calling the > System.exit. My program should run forever. > package test.mains; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > public class CheckJavaSparkContext { > /** > * @param args the command line arguments > */ > public static void main(String[] args) { > SparkConf conf = new SparkConf(); > conf.setAppName("test"); > conf.setMaster("spark://sunshinee:7077"); > try { > new JavaSparkContext(conf); > } catch (Throwable e) { > System.out.println("Caught an exception : " + e.getMessage()); > //e.printStackTrace(); > } > System.out.println("Waiting to complete..."); > while (true) { > } > } > } > Output log > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0 > 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a > loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0) > 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(ps40233); users > with modify permissions: Set(ps40233) > 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on > port 55309. > 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started > 16/03/04 18:01:21 INFO Remoting: Starting remoting > 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128] > 16/03/04 18:01:22 INFO Utils: Successfully started service > 'sparkDriverActorSystem' on port 52128. > 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker > 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster > 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98 > 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB > 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator > 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040 > 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master > spark://sunshinee:7077... > 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master > sunshinee:7077 > java.io.IOException: Failed to connect to sunshinee:7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) >
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182676#comment-15182676 ] Wenchen Fan commented on SPARK-12718: - Hi [~smilegator], It seems that I underestimate the difficulty of this job. I have a simple PR which works fine for common cases, do you mind take a look and see what's missing? You can send out your PR to explain your approach and how to handle special cases. > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13711) Apache Spark driver stopping JVM when master not available
[ https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182674#comment-15182674 ] Shixiong Zhu commented on SPARK-13711: -- This is the correct behavior. Driver needs to talk with Master to request and launch executors. > Apache Spark driver stopping JVM when master not available > --- > > Key: SPARK-13711 > URL: https://issues.apache.org/jira/browse/SPARK-13711 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1, 1.6.0 >Reporter: Era > > In my application Java spark context is created with an unavailable master > URL (you may assume master is down for a maintenance). When creating Java > spark context it leads to stopping JVM that runs spark driver with JVM exit > code 50. > When I checked the logs I found SparkUncaughtExceptionHandler calling the > System.exit. My program should run forever. > package test.mains; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > public class CheckJavaSparkContext { > /** > * @param args the command line arguments > */ > public static void main(String[] args) { > SparkConf conf = new SparkConf(); > conf.setAppName("test"); > conf.setMaster("spark://sunshinee:7077"); > try { > new JavaSparkContext(conf); > } catch (Throwable e) { > System.out.println("Caught an exception : " + e.getMessage()); > //e.printStackTrace(); > } > System.out.println("Waiting to complete..."); > while (true) { > } > } > } > Output log > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0 > 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a > loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0) > 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(ps40233); users > with modify permissions: Set(ps40233) > 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on > port 55309. > 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started > 16/03/04 18:01:21 INFO Remoting: Starting remoting > 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128] > 16/03/04 18:01:22 INFO Utils: Successfully started service > 'sparkDriverActorSystem' on port 52128. > 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker > 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster > 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98 > 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB > 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator > 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040 > 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master > spark://sunshinee:7077... > 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master > sunshinee:7077 > java.io.IOException: Failed to connect to sunshinee:7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at >
[jira] [Assigned] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12718: Assignee: Xiao Li (was: Apache Spark) > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12718: Assignee: Apache Spark (was: Xiao Li) > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182670#comment-15182670 ] Apache Spark commented on SPARK-12718: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11555 > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13712) Add OneVsOne to ML
[ https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13712: Assignee: (was: Apache Spark) > Add OneVsOne to ML > -- > > Key: SPARK-13712 > URL: https://issues.apache.org/jira/browse/SPARK-13712 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Another Meta method for multi-class classification. > Most classification algorithms were designed for balanced data. > The OneVsRest method will generate K models on imbalanced data. > The OneVsOne will train K*(K-1)/2 models on balanced data. > OneVsOne is less sensitive to the problems of imbalanced datasets, and can > usually result in higher precision. > But it is much more computationally expensive, although each model are > trained on a much smaller dataset. (2/K of total) > The OneVsOne is implemented in the way OneVsRest did: > val classifier = new LogisticRegression() > val ovo = new OneVsOne() > ovo.setClassifier(classifier) > val ovoModel = ovo.fit(data) > val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13712) Add OneVsOne to ML
[ https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182660#comment-15182660 ] Apache Spark commented on SPARK-13712: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11554 > Add OneVsOne to ML > -- > > Key: SPARK-13712 > URL: https://issues.apache.org/jira/browse/SPARK-13712 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Another Meta method for multi-class classification. > Most classification algorithms were designed for balanced data. > The OneVsRest method will generate K models on imbalanced data. > The OneVsOne will train K*(K-1)/2 models on balanced data. > OneVsOne is less sensitive to the problems of imbalanced datasets, and can > usually result in higher precision. > But it is much more computationally expensive, although each model are > trained on a much smaller dataset. (2/K of total) > The OneVsOne is implemented in the way OneVsRest did: > val classifier = new LogisticRegression() > val ovo = new OneVsOne() > ovo.setClassifier(classifier) > val ovoModel = ovo.fit(data) > val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13712) Add OneVsOne to ML
[ https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13712: Assignee: Apache Spark > Add OneVsOne to ML > -- > > Key: SPARK-13712 > URL: https://issues.apache.org/jira/browse/SPARK-13712 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > Another Meta method for multi-class classification. > Most classification algorithms were designed for balanced data. > The OneVsRest method will generate K models on imbalanced data. > The OneVsOne will train K*(K-1)/2 models on balanced data. > OneVsOne is less sensitive to the problems of imbalanced datasets, and can > usually result in higher precision. > But it is much more computationally expensive, although each model are > trained on a much smaller dataset. (2/K of total) > The OneVsOne is implemented in the way OneVsRest did: > val classifier = new LogisticRegression() > val ovo = new OneVsOne() > ovo.setClassifier(classifier) > val ovoModel = ovo.fit(data) > val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13712) Add OneVsOne to ML
zhengruifeng created SPARK-13712: Summary: Add OneVsOne to ML Key: SPARK-13712 URL: https://issues.apache.org/jira/browse/SPARK-13712 Project: Spark Issue Type: New Feature Components: ML Reporter: zhengruifeng Priority: Minor Another Meta method for multi-class classification. Most classification algorithms were designed for balanced data. The OneVsRest method will generate K models on imbalanced data. The OneVsOne will train K*(K-1)/2 models on balanced data. OneVsOne is less sensitive to the problems of imbalanced datasets, and can usually result in higher precision. But it is much more computationally expensive, although each model are trained on a much smaller dataset. (2/K of total) The OneVsOne is implemented in the way OneVsRest did: val classifier = new LogisticRegression() val ovo = new OneVsOne() ovo.setClassifier(classifier) val ovoModel = ovo.fit(data) val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182644#comment-15182644 ] Xiao Li commented on SPARK-12718: - So far, SQL generation support for Window functions can work well. However, qualifier-related issues break a few test cases. Because RecoverScopingInfo adds extra subqueries, we need to add a new rule after the batch `Canonicalizer` to add/populate correct qualifiers for the AttributeReference. Now, I am adding this rule. Thanks! > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13711) Apache Spark driver stopping JVM when master not available
Era created SPARK-13711: --- Summary: Apache Spark driver stopping JVM when master not available Key: SPARK-13711 URL: https://issues.apache.org/jira/browse/SPARK-13711 Project: Spark Issue Type: Bug Affects Versions: 1.6.0, 1.4.1 Reporter: Era In my application Java spark context is created with an unavailable master URL (you may assume master is down for a maintenance). When creating Java spark context it leads to stopping JVM that runs spark driver with JVM exit code 50. When I checked the logs I found SparkUncaughtExceptionHandler calling the System.exit. My program should run forever. package test.mains; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; public class CheckJavaSparkContext { /** * @param args the command line arguments */ public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName("test"); conf.setMaster("spark://sunshinee:7077"); try { new JavaSparkContext(conf); } catch (Throwable e) { System.out.println("Caught an exception : " + e.getMessage()); //e.printStackTrace(); } System.out.println("Waiting to complete..."); while (true) { } } } Output log Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0) 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ps40233); users with modify permissions: Set(ps40233) 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on port 55309. 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started 16/03/04 18:01:21 INFO Remoting: Starting remoting 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128] 16/03/04 18:01:22 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 52128. 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master spark://sunshinee:7077... 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master sunshinee:7077 java.io.IOException: Failed to connect to sunshinee:7077 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182584#comment-15182584 ] Xiao Li commented on SPARK-12718: - Sure, Thanks! > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182583#comment-15182583 ] Wenchen Fan commented on SPARK-12718: - Then finish it, we can consolidate them later. > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182578#comment-15182578 ] Xiao Li commented on SPARK-12718: - Hi, [~cloud_fan] Yeah. Almost done. Just let me know if I should continue it. Thanks! > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182565#comment-15182565 ] Masayoshi TSUZUKI commented on SPARK-13710: --- It shows the similar ERROR message and stacktrace also when exit from spark shell. {noformat} scala> :quit 16/03/07 13:06:31 INFO SparkUI: Stopped Spark web UI at http://192.168.33.129:4040 16/03/07 13:06:31 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/03/07 13:06:31 INFO MemoryStore: MemoryStore cleared 16/03/07 13:06:31 INFO BlockManager: BlockManager stopped 16/03/07 13:06:31 INFO BlockManagerMaster: BlockManagerMaster stopped 16/03/07 13:06:31 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/03/07 13:06:31 INFO SparkContext: Successfully stopped SparkContext [WARN] Task failed java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32 at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.setConsoleMode(WindowsSupport.java:60) at scala.tools.jline_embedded.WindowsTerminal.setConsoleMode(WindowsTerminal.java:208) at scala.tools.jline_embedded.WindowsTerminal.restore(WindowsTerminal.java:95) at scala.tools.jline_embedded.TerminalSupport$1.run(TerminalSupport.java:52) at scala.tools.jline_embedded.internal.ShutdownHooks.runTasks(ShutdownHooks.java:66) at scala.tools.jline_embedded.internal.ShutdownHooks.access$000(ShutdownHooks.java:22) at scala.tools.jline_embedded.internal.ShutdownHooks$1.run(ShutdownHooks.java:47) 16/03/07 13:06:31 INFO ShutdownHookManager: Shutdown hook called 16/03/07 13:06:31 INFO ShutdownHookManager: Deleting directory C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940\repl-f753505f-69cf-4593-bb21-f5aa2683bcca 16/03/07 13:06:31 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940\repl-f753505f-69cf-4593-bb21-f5aa2683bcca java.io.IOException: Failed to delete: C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940\repl-f753505f-69cf-4593-bb21-f5aa2683bcca at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:935) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:64) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:61) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:61) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:217) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:189) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:189) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1788) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:189) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:189) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:189) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:189) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:179) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) 16/03/07 13:06:31 INFO ShutdownHookManager: Deleting directory C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940 16/03/07 13:06:31 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940 java.io.IOException: Failed to delete: C:\Users\tsudukim\AppData\Local\Temp\spark-d9077a51-fc78-4852-ad45-2b7085d72940 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:935) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:64) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:61) at
[jira] [Created] (SPARK-13710) Spark shell shows ERROR when launching on Windows
Masayoshi TSUZUKI created SPARK-13710: - Summary: Spark shell shows ERROR when launching on Windows Key: SPARK-13710 URL: https://issues.apache.org/jira/browse/SPARK-13710 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Reporter: Masayoshi TSUZUKI Priority: Minor On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message and stacktrace. {noformat} C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell [ERROR] Terminal initialization failed; falling back to unsupported java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32 at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) at scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204) at scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82) at scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101) at scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158) at scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229) at scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221) at scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209) at scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61) at scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865) at scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862) at scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871) at scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) at scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) at scala.util.Try$.apply(Try.scala:192) at scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) at scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.Stream.collect(Stream.scala:435) at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911) at org.apache.spark.repl.Main$.doMain(Main.scala:64) at org.apache.spark.repl.Main$.main(Main.scala:47) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:737) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/03/07 13:05:32 WARN NativeCodeLoader:
[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182512#comment-15182512 ] Apache Spark commented on SPARK-13600: -- User 'oliverpierson' has created a pull request for this issue: https://github.com/apache/spark/pull/11553 > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13600: Assignee: Oliver Pierson (was: Apache Spark) > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13600: Assignee: Apache Spark (was: Oliver Pierson) > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Apache Spark > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182496#comment-15182496 ] Wenchen Fan commented on SPARK-12718: - Hi, [~xiaol], are you still working on it? I was working on it before and forgot to update this JIRA... > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13709) Spark unable to decode Avro when partitioned
[ https://issues.apache.org/jira/browse/SPARK-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Miller updated SPARK-13709: - Description: There is a problem decoding Avro data with SparkSQL when partitioned. The schema and encoded data are valid -- I'm able to decode the data with the avro-tools CLI utility. I'm also able to decode the data with non-partitioned SparkSQL tables, Hive, other tools as well... except partitioned SparkSQL schemas. For a simple example, I took the example schema and data found in the Oracle documentation here: *Schema* {code:javascript} { "type": "record", "name": "MemberInfo", "namespace": "avro", "fields": [ {"name": "name", "type": { "type": "record", "name": "FullName", "fields": [ {"name": "first", "type": "string"}, {"name": "last", "type": "string"} ] }}, {"name": "age", "type": "int"}, {"name": "address", "type": { "type": "record", "name": "Address", "fields": [ {"name": "street", "type": "string"}, {"name": "city", "type": "string"}, {"name": "state", "type": "string"}, {"name": "zip", "type": "int"} ] }} ] } {code} *Data* {code:javascript} { "name": { "first": "Percival", "last": "Lowell" }, "age": 156, "address": { "street": "Mars Hill Rd", "city": "Flagstaff", "state": "AZ", "zip": 86001 } } {code} *Create* (no partitions - works) If I create with no partitions, I'm able to query the data just fine. {code:sql} CREATE EXTERNAL TABLE IF NOT EXISTS foo ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/path/to/data/dir' TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); {code} *Create* (partitions -- does NOT work) If I create with no partitions, and then manually add a partition, all of my queries return an error. (I need to manually add partitions because I cannot control the structure of the data directories, so dynamic partitioning is not an option.) {code:sql} CREATE EXTERNAL TABLE IF NOT EXISTS foo PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); ALTER TABLE foo ADD PARTITION (ds='1') LOCATION '/path/to/data/dir'; {code} The error: {code} spark-sql> SELECT * FROM foo WHERE ds = '1'; Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.collect(RDD.scala:926) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) at
[jira] [Updated] (SPARK-13709) Spark unable to decode Avro when partitioned
[ https://issues.apache.org/jira/browse/SPARK-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Miller updated SPARK-13709: - Description: There is a problem decoding Avro data with SparkSQL when partitioned. The schema and encoded data are valid -- I'm able to decode the data with the avro-tools CLI utility. I'm also able to decode the data with non-partitioned SparkSQL tables, Hive, other tools as well... except partitioned SparkSQL schemas. For a simple example, I took the example schema and data found in the Oracle documentation here: *Schema* {code:javascript} { "type": "record", "name": "MemberInfo", "namespace": "avro", "fields": [ {"name": "name", "type": { "type": "record", "name": "FullName", "fields": [ {"name": "first", "type": "string"}, {"name": "last", "type": "string"} ] }}, {"name": "age", "type": "int"}, {"name": "address", "type": { "type": "record", "name": "Address", "fields": [ {"name": "street", "type": "string"}, {"name": "city", "type": "string"}, {"name": "state", "type": "string"}, {"name": "zip", "type": "int"} ] }} ] } {code} *Data* {code:javascript} { "name": { "first": "Percival", "last": "Lowell" }, "age": 156, "address": { "street": "Mars Hill Rd", "city": "Flagstaff", "state": "AZ", "zip": 86001 } } {code} *Create* (no partitions - works) If I create with no partitions, I'm able to query the data just fine. {code:sql} CREATE EXTERNAL TABLE IF NOT EXISTS foo ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/path/to/data/dir' TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); {code} *Create* (partitions -- does NOT work) If I create with no partitions, and then manually add a partition, all of my queries return an error. (I need to manually add partitions because I cannot control the structure of the data directories, so dynamic partitioning is not an option.) {code:sql} CREATE EXTERNAL TABLE IF NOT EXISTS foo PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); ALTER TABLE foo ADD PARTITION (ds='1') LOCATION '/path/to/data/dir'; {code} The error: {code} spark-sql> SELECT * FROM foo WHERE ds = '1'; Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.collect(RDD.scala:926) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) at
[jira] [Created] (SPARK-13709) Spark unable to decode Avro when partitioned
Chris Miller created SPARK-13709: Summary: Spark unable to decode Avro when partitioned Key: SPARK-13709 URL: https://issues.apache.org/jira/browse/SPARK-13709 Project: Spark Issue Type: Bug Affects Versions: 1.6.0 Reporter: Chris Miller There is a problem decoding Avro data with SparkSQL when partitioned. The schema and encoded data are valid -- I'm able to decode the data with the avro-tools CLI utility. I'm also able to decode the data with non-partitioned SparkSQL tables, Hive, other tools as well... except partitioned SparkSQL schemas. For a simple example, I took the example schema and data found in the Oracle documentation here: **Schema** {code:javascript} { "type": "record", "name": "MemberInfo", "namespace": "avro", "fields": [ {"name": "name", "type": { "type": "record", "name": "FullName", "fields": [ {"name": "first", "type": "string"}, {"name": "last", "type": "string"} ] }}, {"name": "age", "type": "int"}, {"name": "address", "type": { "type": "record", "name": "Address", "fields": [ {"name": "street", "type": "string"}, {"name": "city", "type": "string"}, {"name": "state", "type": "string"}, {"name": "zip", "type": "int"} ] }} ] } {code} **Data** {code:javascript} { "name": { "first": "Percival", "last": "Lowell" }, "age": 156, "address": { "street": "Mars Hill Rd", "city": "Flagstaff", "state": "AZ", "zip": 86001 } } {code} **Create** (no partitions - works) If I create with no partitions, I'm able to query the data just fine. {code:sql} CREATE EXTERNAL TABLE IF NOT EXISTS foo ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/path/to/data/dir' TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); {code} **Create** (partitions -- does NOT work) If I create with no partitions, and then manually add a partition, all of my queries return an error. (I need to manually add partitions because I cannot control the structure of the data directories, so dynamic partitioning is not an option.) {code:sql} CREATE EXTERNAL TABLE IF NOT EXISTS foo PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); ALTER TABLE foo ADD PARTITION (ds='1') LOCATION '/path/to/data/dir'; {code} The error: {code} spark-sql> SELECT * FROM foo WHERE ds = '1'; Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.collect(RDD.scala:926) at
[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182482#comment-15182482 ] Dongjoon Hyun commented on SPARK-12243: --- Here is the real Jenkins test time. {code} Tests passed in 804 seconds {code} > PySpark tests are slow in Jenkins > - > > Key: SPARK-12243 > URL: https://issues.apache.org/jira/browse/SPARK-12243 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark, Tests >Reporter: Josh Rosen > > In the Jenkins pull request builder, it looks like PySpark tests take around > 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that > we run four Python test suites in parallel. We should try to figure out why > this is slow and see if there's any easy way to speed things up. > Note that the PySpark streaming tests take about 5 minutes to run, so > best-case we're looking at a 10 minute speedup via further parallelization. > We should also try to see whether there are individual slow tests in those > Python suites which can be sped up or skipped. > We could also consider running only the Python 2.6 tests in non-Pyspark pull > request builds and reserve testing of all Python versions for builds which > touch PySpark-related code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182472#comment-15182472 ] Taro L. Saito edited comment on SPARK-5928 at 3/7/16 2:24 AM: -- FYI. I created LArray library that can handle data larger than 2GB, which is the limit of Java byte arrays and mmap files: https://github.com/xerial/larray It looks like there are several reports showing when this 2GB limit can be problematic (especially in processing Spark SQL): http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29 Let me know if there is anything I can work on. was (Author: taroleo): FYI. I created LArray library that can handle data larger than 2GB, which is the limit of Java byte arrays and mmap files: https://github.com/xerial/larray It looks like there are several reports when this 2GB limit can be problematic (especially in processing Spark SQL): http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29 Let me know if there is anything that I can work on. > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at >
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182476#comment-15182476 ] Jose Cambronero commented on SPARK-8884: [~yuhaoyan] please do! I unfortunately got really busy last semester and this in grad school and was not able to continue following up on this. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Priority: Minor > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13034) PySpark ml.classification support export/import
[ https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182474#comment-15182474 ] Apache Spark commented on SPARK-13034: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/11552 > PySpark ml.classification support export/import > --- > > Key: SPARK-13034 > URL: https://issues.apache.org/jira/browse/SPARK-13034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/classification.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13034) PySpark ml.classification support export/import
[ https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13034: Assignee: (was: Apache Spark) > PySpark ml.classification support export/import > --- > > Key: SPARK-13034 > URL: https://issues.apache.org/jira/browse/SPARK-13034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/classification.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13034) PySpark ml.classification support export/import
[ https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13034: Assignee: Apache Spark > PySpark ml.classification support export/import > --- > > Key: SPARK-13034 > URL: https://issues.apache.org/jira/browse/SPARK-13034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/classification.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182472#comment-15182472 ] Taro L. Saito commented on SPARK-5928: -- FYI. I created LArray library that can handle data larger than 2GB, which is the limit of Java byte arrays and mmap files: https://github.com/xerial/larray It looks like there are several reports when this 2GB limit can be problematic (especially in processing Spark SQL): http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29 Let me know if there is anything that I can work on. > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at >
[jira] [Comment Edited] (SPARK-12243) PySpark tests are slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182460#comment-15182460 ] Dongjoon Hyun edited comment on SPARK-12243 at 3/7/16 1:49 AM: --- According to the log, the total time of all tests are *3077s*. So, the minimum required time for 4 processes is *769s*. was (Author: dongjoon): According to the log, the total time of all tests are **3077s**. So, the minimum required time for 4 processes is **769s**. > PySpark tests are slow in Jenkins > - > > Key: SPARK-12243 > URL: https://issues.apache.org/jira/browse/SPARK-12243 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark, Tests >Reporter: Josh Rosen > > In the Jenkins pull request builder, it looks like PySpark tests take around > 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that > we run four Python test suites in parallel. We should try to figure out why > this is slow and see if there's any easy way to speed things up. > Note that the PySpark streaming tests take about 5 minutes to run, so > best-case we're looking at a 10 minute speedup via further parallelization. > We should also try to see whether there are individual slow tests in those > Python suites which can be sped up or skipped. > We could also consider running only the Python 2.6 tests in non-Pyspark pull > request builds and reserve testing of all Python versions for builds which > touch PySpark-related code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182460#comment-15182460 ] Dongjoon Hyun commented on SPARK-12243: --- According to the log, the total time of all tests are **3077s**. So, the minimum required time for 4 processes is **769s**. > PySpark tests are slow in Jenkins > - > > Key: SPARK-12243 > URL: https://issues.apache.org/jira/browse/SPARK-12243 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark, Tests >Reporter: Josh Rosen > > In the Jenkins pull request builder, it looks like PySpark tests take around > 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that > we run four Python test suites in parallel. We should try to figure out why > this is slow and see if there's any easy way to speed things up. > Note that the PySpark streaming tests take about 5 minutes to run, so > best-case we're looking at a 10 minute speedup via further parallelization. > We should also try to see whether there are individual slow tests in those > Python suites which can be sped up or skipped. > We could also consider running only the Python 2.6 tests in non-Pyspark pull > request builds and reserve testing of all Python versions for builds which > touch PySpark-related code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12243) PySpark tests are slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12243: Assignee: (was: Apache Spark) > PySpark tests are slow in Jenkins > - > > Key: SPARK-12243 > URL: https://issues.apache.org/jira/browse/SPARK-12243 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark, Tests >Reporter: Josh Rosen > > In the Jenkins pull request builder, it looks like PySpark tests take around > 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that > we run four Python test suites in parallel. We should try to figure out why > this is slow and see if there's any easy way to speed things up. > Note that the PySpark streaming tests take about 5 minutes to run, so > best-case we're looking at a 10 minute speedup via further parallelization. > We should also try to see whether there are individual slow tests in those > Python suites which can be sped up or skipped. > We could also consider running only the Python 2.6 tests in non-Pyspark pull > request builds and reserve testing of all Python versions for builds which > touch PySpark-related code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182455#comment-15182455 ] Apache Spark commented on SPARK-12243: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11551 > PySpark tests are slow in Jenkins > - > > Key: SPARK-12243 > URL: https://issues.apache.org/jira/browse/SPARK-12243 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark, Tests >Reporter: Josh Rosen > > In the Jenkins pull request builder, it looks like PySpark tests take around > 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that > we run four Python test suites in parallel. We should try to figure out why > this is slow and see if there's any easy way to speed things up. > Note that the PySpark streaming tests take about 5 minutes to run, so > best-case we're looking at a 10 minute speedup via further parallelization. > We should also try to see whether there are individual slow tests in those > Python suites which can be sped up or skipped. > We could also consider running only the Python 2.6 tests in non-Pyspark pull > request builds and reserve testing of all Python versions for builds which > touch PySpark-related code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12243) PySpark tests are slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12243: Assignee: Apache Spark > PySpark tests are slow in Jenkins > - > > Key: SPARK-12243 > URL: https://issues.apache.org/jira/browse/SPARK-12243 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark, Tests >Reporter: Josh Rosen >Assignee: Apache Spark > > In the Jenkins pull request builder, it looks like PySpark tests take around > 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that > we run four Python test suites in parallel. We should try to figure out why > this is slow and see if there's any easy way to speed things up. > Note that the PySpark streaming tests take about 5 minutes to run, so > best-case we're looking at a 10 minute speedup via further parallelization. > We should also try to see whether there are individual slow tests in those > Python suites which can be sped up or skipped. > We could also consider running only the Python 2.6 tests in non-Pyspark pull > request builds and reserve testing of all Python versions for builds which > touch PySpark-related code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182445#comment-15182445 ] Xiao Li commented on SPARK-12718: - In Window Spec, the possible inputs are: 1. partition by + order by, 2. order by 3. distribute by + sort by 4. sort by 5. cluster by > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12243) PySpark tests are slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182442#comment-15182442 ] Dongjoon Hyun commented on SPARK-12243: --- Hi, [~joshrosen]. According to the recent [Running PySpark tests log|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console], it seems that the long running test starts at the end due to FIFO queue. In that case, I think we can reduce the test time by just starting some long running tests first with simple priority queue. {code} ... Finished test(python3.4): pyspark.streaming.tests (213s) Finished test(pypy): pyspark.sql.tests (92s) Finished test(pypy): pyspark.streaming.tests (280s) Tests passed in 962 seconds {code} The long tests are the followings: * pyspark.tests * pyspark.mllib.tests * pyspark.streaming.tests I'll make a PR for this as a first attempt to resolve this JIRA issue. > PySpark tests are slow in Jenkins > - > > Key: SPARK-12243 > URL: https://issues.apache.org/jira/browse/SPARK-12243 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark, Tests >Reporter: Josh Rosen > > In the Jenkins pull request builder, it looks like PySpark tests take around > 992 seconds (~16.5 minutes) of end-to-end time to run, despite the fact that > we run four Python test suites in parallel. We should try to figure out why > this is slow and see if there's any easy way to speed things up. > Note that the PySpark streaming tests take about 5 minutes to run, so > best-case we're looking at a 10 minute speedup via further parallelization. > We should also try to see whether there are individual slow tests in those > Python suites which can be sped up or skipped. > We could also consider running only the Python 2.6 tests in non-Pyspark pull > request builds and reserve testing of all Python versions for builds which > touch PySpark-related code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182441#comment-15182441 ] Xiao Li commented on SPARK-12718: - If users use cluster by clauses, or DISTRIBUTE BY + SORT BY clauses, the generated SQL will convert them to Partition By + Order By. > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182439#comment-15182439 ] Xiao Li commented on SPARK-12718: - Will not add extra subquery here. Trying to rebuild the original Window SQL > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6162) Handle missing values in GBM
[ https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182434#comment-15182434 ] Joseph K. Bradley commented on SPARK-6162: -- I agree this will be nice to add someday, but it's less pressing than other tasks for now. > Handle missing values in GBM > > > Key: SPARK-6162 > URL: https://issues.apache.org/jira/browse/SPARK-6162 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.1 >Reporter: Devesh Parekh > > We build a lot of predictive models over data combined from multiple sources, > where some entries may not have all sources of data and so some values are > missing in each feature vector. Another place this might come up is if you > have features from slightly heterogeneous items (or items composed of > heterogeneous subcomponents) that share many features in common but may have > extra features for different types, and you don't want to manually train > models for every different type. > R's GBM library, which is what we are currently using, deals with this type > of data nicely by making "missing" nodes in the decision tree (a surrogate > split) for features that can have missing values. We'd like to do the same > with MLLib, but LabeledPoint would need to support missing values, and > GradientBoostedTrees would need to be modified to deal with them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12731) PySpark docstring cleanup
[ https://issues.apache.org/jira/browse/SPARK-12731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-12731. - Resolution: Won't Fix > PySpark docstring cleanup > - > > Key: SPARK-12731 > URL: https://issues.apache.org/jira/browse/SPARK-12731 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: holdenk >Priority: Trivial > > We don't currently have any automated checks that our PySpark docstring lines > are within pep8/275/276 lenght limits (since the pep8 checker doesn't handle > this). As such there are ~400 non-comformant docstring lines. This JIRA is to > fix those docstring lines and add a command to lint python to fail on long > lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12731) PySpark docstring cleanup
[ https://issues.apache.org/jira/browse/SPARK-12731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182432#comment-15182432 ] Joseph K. Bradley commented on SPARK-12731: --- Alright, it sounds like the consensus is that we don't think this is worth fixing, so I'll close this. > PySpark docstring cleanup > - > > Key: SPARK-12731 > URL: https://issues.apache.org/jira/browse/SPARK-12731 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: holdenk >Priority: Trivial > > We don't currently have any automated checks that our PySpark docstring lines > are within pep8/275/276 lenght limits (since the pep8 checker doesn't handle > this). As such there are ~400 non-comformant docstring lines. This JIRA is to > fix those docstring lines and add a command to lint python to fail on long > lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13667) Support for specifying custom date format for date and timestamp types
[ https://issues.apache.org/jira/browse/SPARK-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13667: Assignee: Apache Spark > Support for specifying custom date format for date and timestamp types > -- > > Key: SPARK-13667 > URL: https://issues.apache.org/jira/browse/SPARK-13667 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Currently, CSV data source does not support to parse date and timestamp types > in custom format and infer the type of timestamp type in custom format. > It looks quite many of users want this feature. It would be great to set > custom date format. > This was reported in spark-csv. > https://github.com/databricks/spark-csv/issues/279 > https://github.com/databricks/spark-csv/issues/262 > https://github.com/databricks/spark-csv/issues/266 > Currently I submitted a PR for this in spark-csv > https://github.com/databricks/spark-csv/pull/280 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13667) Support for specifying custom date format for date and timestamp types
[ https://issues.apache.org/jira/browse/SPARK-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13667: Assignee: (was: Apache Spark) > Support for specifying custom date format for date and timestamp types > -- > > Key: SPARK-13667 > URL: https://issues.apache.org/jira/browse/SPARK-13667 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, CSV data source does not support to parse date and timestamp types > in custom format and infer the type of timestamp type in custom format. > It looks quite many of users want this feature. It would be great to set > custom date format. > This was reported in spark-csv. > https://github.com/databricks/spark-csv/issues/279 > https://github.com/databricks/spark-csv/issues/262 > https://github.com/databricks/spark-csv/issues/266 > Currently I submitted a PR for this in spark-csv > https://github.com/databricks/spark-csv/pull/280 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13667) Support for specifying custom date format for date and timestamp types
[ https://issues.apache.org/jira/browse/SPARK-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182431#comment-15182431 ] Apache Spark commented on SPARK-13667: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/11550 > Support for specifying custom date format for date and timestamp types > -- > > Key: SPARK-13667 > URL: https://issues.apache.org/jira/browse/SPARK-13667 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, CSV data source does not support to parse date and timestamp types > in custom format and infer the type of timestamp type in custom format. > It looks quite many of users want this feature. It would be great to set > custom date format. > This was reported in spark-csv. > https://github.com/databricks/spark-csv/issues/279 > https://github.com/databricks/spark-csv/issues/262 > https://github.com/databricks/spark-csv/issues/266 > Currently I submitted a PR for this in spark-csv > https://github.com/databricks/spark-csv/pull/280 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182420#comment-15182420 ] yuhao yang commented on SPARK-8884: --- Hi [~josepablocam]. Do you mind if I continue to work on this? I think this is well-written yet I might need to start another PR to finish it. Let me know if you still plan to work on it. Thanks. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Priority: Minor > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12566) GLM model family, link function support in SparkR:::glm
[ https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12566: Assignee: Apache Spark (was: yuhao yang) > GLM model family, link function support in SparkR:::glm > --- > > Key: SPARK-12566 > URL: https://issues.apache.org/jira/browse/SPARK-12566 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Critical > > This JIRA is for extending the support of MLlib's Generalized Linear Models > (GLMs) to more model families and link functions in SparkR. After > SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR > with support of popular families and link functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12566) GLM model family, link function support in SparkR:::glm
[ https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12566: Assignee: yuhao yang (was: Apache Spark) > GLM model family, link function support in SparkR:::glm > --- > > Key: SPARK-12566 > URL: https://issues.apache.org/jira/browse/SPARK-12566 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Critical > > This JIRA is for extending the support of MLlib's Generalized Linear Models > (GLMs) to more model families and link functions in SparkR. After > SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR > with support of popular families and link functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12566) GLM model family, link function support in SparkR:::glm
[ https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182401#comment-15182401 ] Apache Spark commented on SPARK-12566: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/11549 > GLM model family, link function support in SparkR:::glm > --- > > Key: SPARK-12566 > URL: https://issues.apache.org/jira/browse/SPARK-12566 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Critical > > This JIRA is for extending the support of MLlib's Generalized Linear Models > (GLMs) to more model families and link functions in SparkR. After > SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR > with support of popular families and link functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12566) GLM model family, link function support in SparkR:::glm
[ https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182384#comment-15182384 ] yuhao yang commented on SPARK-12566: Since we already have a glm in SparkR which is based on LogisticRegressionModel and LinearRegressionModel. There're three ways to extend it as I understand: 1. Change the current glm to use GeneralizedLinearRegression. Create another lm interface for sparkR, and use LR as the model. 2. Keep glm R interface. and replace its implementation with GLM. This means R can not invoke LR anymore. 2. Keep glm R interface, and combine the implementation with both LR and GLM based on different solver parameter. I'd prefer to use option 1. And I'm gonna send one PR(WIP) for solution 2, which can later be adjusted to 1 or 3. > GLM model family, link function support in SparkR:::glm > --- > > Key: SPARK-12566 > URL: https://issues.apache.org/jira/browse/SPARK-12566 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Critical > > This JIRA is for extending the support of MLlib's Generalized Linear Models > (GLMs) to more model families and link functions in SparkR. After > SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR > with support of popular families and link functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13496) Optimizing count distinct changes the resulting column name
[ https://issues.apache.org/jira/browse/SPARK-13496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182332#comment-15182332 ] Ryan Blue commented on SPARK-13496: --- I wouldn't say this is a duplicate, though I'm fine with provisionally closing this. SPARK-12593 is for generating SQL from logical plans and this is a bug report. > Optimizing count distinct changes the resulting column name > --- > > Key: SPARK-13496 > URL: https://issues.apache.org/jira/browse/SPARK-13496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > SPARK-9241 updated the optimizer to rewrite count distinct. That change uses > a count that is no longer distinct because duplicates are eliminated further > down in the plan. This caused the name of the column to change: > {code:title=Spark 1.5.2} > scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a")) > res0: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT a): bigint] > == Physical Plan == > TungstenAggregate(key=[], > functions=[(count(a#7),mode=Complete,isDistinct=true)], > output=[COUNT(DISTINCT a)#9L]) > TungstenAggregate(key=[a#7], functions=[], output=[a#7]) > TungstenExchange SinglePartition >TungstenAggregate(key=[a#7], functions=[], output=[a#7]) > LocalTableScan [a#7], [[1]] > {code} > {code:title=Spark 1.6.0} > scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a")) > res0: org.apache.spark.sql.DataFrame = [count(a): bigint] > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else > null),mode=Final,isDistinct=false)], output=[count(a)#31L]) > +- TungstenExchange SinglePartition, None >+- TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else > null),mode=Partial,isDistinct=false)], output=[count#39L]) > +- TungstenAggregate(key=[a#36,gid#35], functions=[], > output=[a#36,gid#35]) > +- TungstenExchange hashpartitioning(a#36,gid#35,500), None > +- TungstenAggregate(key=[a#36,gid#35], functions=[], > output=[a#36,gid#35]) >+- Expand [List(a#29, 1)], [a#36,gid#35] > +- LocalTableScan [a#29], [[1]] > {code} > This has broken jobs that used the generated name. For example, > {{withColumnRenamed("COUNT(DISTINCT a)", "c")}}. > I think that the previous generated name is correct, even though the plan has > changed. > [~marmbrus], you may want to take a look. It looks like you reviewed > SPARK-9241 and have some context here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup
[ https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-13620: --- > Avoid reverse DNS lookup for 0.0.0.0 on startup > --- > > Key: SPARK-13620 > URL: https://issues.apache.org/jira/browse/SPARK-13620 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Daniel Darabos >Priority: Minor > > I noticed we spend 5+ seconds during application startup with the following > stack trace: > {code} > at java.net.Inet6AddressImpl.getHostByAddr(Native Method) > at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926) > at java.net.InetAddress.getHostFromNameService(InetAddress.java:611) > at java.net.InetAddress.getHostName(InetAddress.java:553) > at java.net.InetAddress.getHostName(InetAddress.java:525) > at > java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) > at > java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) > at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) > at org.spark-project.jetty.server.Server.(Server.java:115) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243) > at > org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262) > at > org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:136) > at > org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) > at > org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) > at scala.Option.foreach(Option.scala:236) > at org.apache.spark.SparkContext.(SparkContext.scala:481) > {code} > Spark wants to start a server on localhost. So it [creates an > {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243] > [with hostname > {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136]. > Spark passes in a hostname string, but Java [recognizes that it's actually > an > address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220] > and so sets the hostname to {{null}}. So when Jetty [calls > {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115] > Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds > on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04. > There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set > a hostname. In this case [{{InetAddress.anyLocalAddress()}} is > used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166], > which is the same, but does not need resolving. > {code} > scala> { val t0 = System.currentTimeMillis; new > java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; > System.currentTimeMillis - t0 } > res0: Long = 5432 > scala> { val t0 = System.currentTimeMillis; new > java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 } > res1: Long = 0 > {code} > I'll send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13609) Support Column Pruning for MapPartitions
[ https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13609: -- Assignee: Xiao Li > Support Column Pruning for MapPartitions > > > Key: SPARK-13609 > URL: https://issues.apache.org/jira/browse/SPARK-13609 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > {code} > case class OtherTuple(_1: String, _2: Int) > val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS() > ds.as[OtherTuple].map(identity[OtherTuple]).explain(true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13647) also check if numeric value is within allowed range in _verify_type
[ https://issues.apache.org/jira/browse/SPARK-13647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13647: -- Assignee: Wenchen Fan > also check if numeric value is within allowed range in _verify_type > --- > > Key: SPARK-13647 > URL: https://issues.apache.org/jira/browse/SPARK-13647 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup
[ https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13620. --- Resolution: Not A Problem Fix Version/s: (was: 2.0.0) > Avoid reverse DNS lookup for 0.0.0.0 on startup > --- > > Key: SPARK-13620 > URL: https://issues.apache.org/jira/browse/SPARK-13620 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Daniel Darabos >Priority: Minor > > I noticed we spend 5+ seconds during application startup with the following > stack trace: > {code} > at java.net.Inet6AddressImpl.getHostByAddr(Native Method) > at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926) > at java.net.InetAddress.getHostFromNameService(InetAddress.java:611) > at java.net.InetAddress.getHostName(InetAddress.java:553) > at java.net.InetAddress.getHostName(InetAddress.java:525) > at > java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) > at > java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) > at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) > at org.spark-project.jetty.server.Server.(Server.java:115) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243) > at > org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262) > at > org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:136) > at > org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) > at > org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481) > at scala.Option.foreach(Option.scala:236) > at org.apache.spark.SparkContext.(SparkContext.scala:481) > {code} > Spark wants to start a server on localhost. So it [creates an > {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243] > [with hostname > {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136]. > Spark passes in a hostname string, but Java [recognizes that it's actually > an > address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220] > and so sets the hostname to {{null}}. So when Jetty [calls > {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115] > Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds > on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04. > There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set > a hostname. In this case [{{InetAddress.anyLocalAddress()}} is > used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166], > which is the same, but does not need resolving. > {code} > scala> { val t0 = System.currentTimeMillis; new > java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; > System.currentTimeMillis - t0 } > res0: Long = 5432 > scala> { val t0 = System.currentTimeMillis; new > java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 } > res1: Long = 0 > {code} > I'll send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13598) Remove LeftSemiJoinBNL
[ https://issues.apache.org/jira/browse/SPARK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13598: -- Assignee: Davies Liu > Remove LeftSemiJoinBNL > -- > > Key: SPARK-13598 > URL: https://issues.apache.org/jira/browse/SPARK-13598 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Broadcast left semi join without joining keys is already supported in > BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, > we should remove that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13574) Improve parquet dictionary decoding for strings
[ https://issues.apache.org/jira/browse/SPARK-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13574: -- Assignee: Nong Li > Improve parquet dictionary decoding for strings > --- > > Key: SPARK-13574 > URL: https://issues.apache.org/jira/browse/SPARK-13574 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Nong Li >Priority: Minor > Fix For: 2.0.0 > > > Currently, the parquet reader will copy the dictionary value for each data > value. This is bad for string columns as we explode the dictionary during > decode. We should instead, have the data values point to the safe backing > memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13630) Add optimizer rule to collapse sorts
[ https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13630: -- Priority: Minor (was: Major) Fix Version/s: (was: 2.0.0) > Add optimizer rule to collapse sorts > > > Key: SPARK-13630 > URL: https://issues.apache.org/jira/browse/SPARK-13630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunitha Kambhampati >Priority: Minor > > It is possible to collapse adjacent sorts and keep the last one.This > task is to add optimizer rule to collapse adjacent sorts if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13255) Integrate vectorized parquet scan with whole stage codegen.
[ https://issues.apache.org/jira/browse/SPARK-13255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13255: -- Assignee: Nong Li > Integrate vectorized parquet scan with whole stage codegen. > --- > > Key: SPARK-13255 > URL: https://issues.apache.org/jira/browse/SPARK-13255 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > The generated whole stage codegen is intended to be run over batches of rows. > This task is to integrate ColumnarBatches with whole stage codegen. > The resulting generated code should look something like: > {code} > Iterator input; > void process() { > while (input.hasNext()) { > ColumnarBatch batch = input.next(); > for (Row: batch) { > // Current function > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13685: -- Fix Version/s: (was: 2.0.0) > Rename catalog.Catalog to ExternalCatalog > - > > Key: SPARK-13685 > URL: https://issues.apache.org/jira/browse/SPARK-13685 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for mllib
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13677: -- Component/s: MLlib > Support Tree-Based Feature Transformation for mllib > --- > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks
[ https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13668: -- Component/s: SQL > Reorder filter/join predicates to short-circuit isNotNull checks > > > Key: SPARK-13668 > URL: https://issues.apache.org/jira/browse/SPARK-13668 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sameer Agarwal > > If a filter predicate or a join condition consists of `IsNotNull` checks, we > should reorder these checks such that these non-nullability checks are > evaluated before the rest of the predicates. > For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we > should rewrite this as `isNotNull(b) && a > 5` during physical plan > generation. > cc [~nongli] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks
[ https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13668: -- Priority: Minor (was: Major) > Reorder filter/join predicates to short-circuit isNotNull checks > > > Key: SPARK-13668 > URL: https://issues.apache.org/jira/browse/SPARK-13668 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sameer Agarwal >Priority: Minor > > If a filter predicate or a join condition consists of `IsNotNull` checks, we > should reorder these checks such that these non-nullability checks are > evaluated before the rest of the predicates. > For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we > should rewrite this as `isNotNull(b) && a > 5` during physical plan > generation. > cc [~nongli] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13702) Use diamond operator for generic instance creation in Java code
[ https://issues.apache.org/jira/browse/SPARK-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13702: -- Component/s: Examples > Use diamond operator for generic instance creation in Java code > --- > > Key: SPARK-13702 > URL: https://issues.apache.org/jira/browse/SPARK-13702 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Dongjoon Hyun >Priority: Trivial > > Java 7 or higher supports `diamond` operator which replaces the type > arguments required to invoke the constructor of a generic class with an empty > set of type parameters (<>). Currently, Spark Java code use mixed usage of > this. This issue replaces existing codes to use `diamond` operator and add > Checkstyle rule. > {code} > -List> kafkaStreams = new > ArrayList >(numStreams); > +List > kafkaStreams = new > ArrayList<>(numStreams); > {code} > {code} > -Set > edges = new HashSet Integer>>(numEdges); > +Set > edges = new HashSet<>(numEdges); > {code} > *Reference* > https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK
[ https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13648: -- Priority: Minor (was: Major) Component/s: SQL Summary: org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK (was: org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError) > org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on > IBM JDK > > > Key: SPARK-13648 > URL: https://issues.apache.org/jira/browse/SPARK-13648 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Fails on vendor specific JVMs ( e.g IBM JVM ) >Reporter: Tim Preece >Priority: Minor > > When running the standard Spark unit tests on the IBM Java SDK the hive > VersionsSuite fail with the following error. > java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState > when creating Hive client using classpath: .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding
[ https://issues.apache.org/jira/browse/SPARK-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13606: -- Component/s: PySpark > Error from python worker: /usr/local/bin/python2.7: undefined symbol: > _PyCodec_LookupTextEncoding > --- > > Key: SPARK-13606 > URL: https://issues.apache.org/jira/browse/SPARK-13606 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Avatar Zhang > > Error from python worker: > /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: > undefined symbol: _PyCodec_LookupTextEncoding > PYTHONPATH was: > > /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13679) Pyspark job fails with Oozie
[ https://issues.apache.org/jira/browse/SPARK-13679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13679: -- Target Version/s: (was: 1.6.0) Priority: Minor (was: Major) > Pyspark job fails with Oozie > > > Key: SPARK-13679 > URL: https://issues.apache.org/jira/browse/SPARK-13679 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit, YARN >Affects Versions: 1.6.0 > Environment: Hadoop 2.7.2, Spark 1.6.0 on Yarn, Oozie 4.2.0 > Cluster secured with Kerberos >Reporter: Alexandre Linte >Priority: Minor > > Hello, > I'm trying to run pi.py example in a pyspark job with Oozie. Every try I made > failed for the same reason: key not found: SPARK_HOME. > Note: A scala job works well in the environment with Oozie. > The logs on the executors are: > {noformat} > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/mnt/hd4/hadoop/yarn/local/filecache/145/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/mnt/hd2/hadoop/yarn/local/filecache/155/spark-assembly-1.6.0-hadoop2.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/application/Hadoop/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > log4j:ERROR setFile(null,true) call failed. > java.io.FileNotFoundException: > /mnt/hd7/hadoop/yarn/log/application_1454673025841_13136/container_1454673025841_13136_01_01 > (Is a directory) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:142) > at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) > at > org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165) > at > org.apache.hadoop.yarn.ContainerLogAppender.activateOptions(ContainerLogAppender.java:55) > at > org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307) > at > org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172) > at > org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104) > at > org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:809) > at > org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:735) > at > org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:615) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:502) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:547) > at > org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:483) > at org.apache.log4j.LogManager.(LogManager.java:127) > at > org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64) > at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:285) > at > org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155) > at > org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:132) > at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:275) > at > org.apache.hadoop.service.AbstractService.(AbstractService.java:43) > Using properties file: null > Parsed arguments: > master yarn-master > deployMode cluster > executorMemory null > executorCores null > totalExecutorCores null > propertiesFile null > driverMemorynull > driverCores null > driverExtraClassPathnull > driverExtraLibraryPath null > driverExtraJavaOptions null > supervise false > queue null > numExecutorsnull > files null > pyFiles null > archivesnull > mainClass null > primaryResource > hdfs://hadoopsandbox/User/toto/WORK/Oozie/pyspark/lib/pi.py > namePysparkpi example > childArgs [100] > jarsnull > packagesnull > packagesExclusions null > repositoriesnull > verbose true > Spark properties used, including those specified through > --conf and those from the properties file null: > spark.executorEnv.SPARK_HOME -> /opt/application/Spark/current >
[jira] [Updated] (SPARK-13692) Fix trivial Coverity/Checkstyle defects
[ https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13692: -- Description: This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. * Implement both null and type checking in equals functions. * Fix wrong type casting logic in SimpleJavaBean2.equals. * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. * Fix coding style: Add '{}' to single `for` statement in mllib examples. * Remove unused imports in `ColumnarBatch`. * Remove unused fields in `ChunkFetchIntegrationSuite`. * Add `stop()` to prevent resource leak. Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583]. was: This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. * Implement both null and type checking in equals functions. * Fix wrong type casting logic in SimpleJavaBean2.equals. * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. * Fix coding style: Add '{}' to single `for` statement in mllib examples. * Remove unused imports in `ColumnarBatch`. * Remove unused fields in `ChunkFetchIntegrationSuite`. * Add `close()` to prevent resource leak. Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583]. > Fix trivial Coverity/Checkstyle defects > --- > > Key: SPARK-13692 > URL: https://issues.apache.org/jira/browse/SPARK-13692 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue fixes the following potential bugs and Java coding style detected > by Coverity and Checkstyle. > * Implement both null and type checking in equals functions. > * Fix wrong type casting logic in SimpleJavaBean2.equals. > * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. > * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. > * Fix coding style: Add '{}' to single `for` statement in mllib examples. > * Remove unused imports in `ColumnarBatch`. > * Remove unused fields in `ChunkFetchIntegrationSuite`. > * Add `stop()` to prevent resource leak. > Please note that the last two checkstyle errors exist on newly added commits > after [SPARK-13583]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13438) Remove by default dash from output paths
[ https://issues.apache.org/jira/browse/SPARK-13438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13438. --- Resolution: Won't Fix > Remove by default dash from output paths > > > Key: SPARK-13438 > URL: https://issues.apache.org/jira/browse/SPARK-13438 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Peter Ableda > > The current implementation uses the above schema for the saveAsTextFiles, > saveAsObjectFiles, saveAsHadoopFiles methods. > {code} > "prefix-TIME_IN_MS[.suffix]" > {code} > Spark generates the following with a *"/data/timestamp="* prefix. > {code} > /data/timestamp=-1455894364000/_SUCCESS > /data/timestamp=-1455894364000/part-0 > {code} > I would like to remove the dash from the path to be able to create proper > partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
[ https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13688: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Add option to use dynamic allocation even if spark.executor.instances is set. > - > > Key: SPARK-13688 > URL: https://issues.apache.org/jira/browse/SPARK-13688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.0 >Reporter: Ryan Blue >Priority: Minor > > When both spark.dynamicAllocation.enabled and spark.executor.instances are > set, dynamic resource allocation is disabled (see SPARK-9092). This is a > reasonable default, but I think there should be a configuration property to > override it because it isn't obvious to users that dynamic allocation and > number of executors are mutually exclusive. We see users setting > --num-executors because that looks like what they want: a way to get more > executors. > I propose adding a new boolean property, > spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation > the default when both are set and uses --num-executors as the minimum number > of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
[ https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13688. --- Resolution: Won't Fix > Add option to use dynamic allocation even if spark.executor.instances is set. > - > > Key: SPARK-13688 > URL: https://issues.apache.org/jira/browse/SPARK-13688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.0 >Reporter: Ryan Blue >Priority: Minor > > When both spark.dynamicAllocation.enabled and spark.executor.instances are > set, dynamic resource allocation is disabled (see SPARK-9092). This is a > reasonable default, but I think there should be a configuration property to > override it because it isn't obvious to users that dynamic allocation and > number of executors are mutually exclusive. We see users setting > --num-executors because that looks like what they want: a way to get more > executors. > I propose adding a new boolean property, > spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation > the default when both are set and uses --num-executors as the minimum number > of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13599) Groovy-all ends up in spark-assembly if hive profile set
[ https://issues.apache.org/jira/browse/SPARK-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182299#comment-15182299 ] Sean Owen commented on SPARK-13599: --- [~rxin] do you object to me back-porting this to 1.6? I didn't see a resolution on the private@ thread about this either way. > Groovy-all ends up in spark-assembly if hive profile set > > > Key: SPARK-13599 > URL: https://issues.apache.org/jira/browse/SPARK-13599 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.5.0, 1.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 2.0.0 > > > If you do a build with {{-Phive,hive-thriftserver}} then the contents of > {{org.codehaus.groovy:groovy-all}} gets into the spark-assembly.jar > This bad because > * it makes the JAR bigger > * it makes the build longer > * it's an uber-JAR itself, so can include things (maybe even conflicting > things) > * It's something else that needs to be kept up to date security-wise -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13496) Optimizing count distinct changes the resulting column name
[ https://issues.apache.org/jira/browse/SPARK-13496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13496. --- Resolution: Duplicate Provisionally labeling this a duplicate then > Optimizing count distinct changes the resulting column name > --- > > Key: SPARK-13496 > URL: https://issues.apache.org/jira/browse/SPARK-13496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > SPARK-9241 updated the optimizer to rewrite count distinct. That change uses > a count that is no longer distinct because duplicates are eliminated further > down in the plan. This caused the name of the column to change: > {code:title=Spark 1.5.2} > scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a")) > res0: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT a): bigint] > == Physical Plan == > TungstenAggregate(key=[], > functions=[(count(a#7),mode=Complete,isDistinct=true)], > output=[COUNT(DISTINCT a)#9L]) > TungstenAggregate(key=[a#7], functions=[], output=[a#7]) > TungstenExchange SinglePartition >TungstenAggregate(key=[a#7], functions=[], output=[a#7]) > LocalTableScan [a#7], [[1]] > {code} > {code:title=Spark 1.6.0} > scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a")) > res0: org.apache.spark.sql.DataFrame = [count(a): bigint] > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else > null),mode=Final,isDistinct=false)], output=[count(a)#31L]) > +- TungstenExchange SinglePartition, None >+- TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else > null),mode=Partial,isDistinct=false)], output=[count#39L]) > +- TungstenAggregate(key=[a#36,gid#35], functions=[], > output=[a#36,gid#35]) > +- TungstenExchange hashpartitioning(a#36,gid#35,500), None > +- TungstenAggregate(key=[a#36,gid#35], functions=[], > output=[a#36,gid#35]) >+- Expand [List(a#29, 1)], [a#36,gid#35] > +- LocalTableScan [a#29], [[1]] > {code} > This has broken jobs that used the generated name. For example, > {{withColumnRenamed("COUNT(DISTINCT a)", "c")}}. > I think that the previous generated name is correct, even though the plan has > changed. > [~marmbrus], you may want to take a look. It looks like you reviewed > SPARK-9241 and have some context here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13705) UpdateStateByKey Operation documentation incorrectly refers to StatefulNetworkWordCount
[ https://issues.apache.org/jira/browse/SPARK-13705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13705: -- Target Version/s: (was: 1.6.0) Priority: Trivial (was: Minor) Fix Version/s: (was: 1.6.0) Don't set Fix, Target versions > UpdateStateByKey Operation documentation incorrectly refers to > StatefulNetworkWordCount > --- > > Key: SPARK-13705 > URL: https://issues.apache.org/jira/browse/SPARK-13705 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Rishi >Priority: Trivial > Labels: documentation > > Doc Issue : Apache Spark Streaming guide elaborates UpdateStateByKey > Operation and goes onto give an example. However for complete code sample it > referes to > {{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/scala/org/apache > /spark/examples/streaming/StatefulNetworkWordCount.scala). > StatefulNetworkWordCount.scala has changed to demonstrate more recent API > mapWithState. > This creates confusion in the document. > Till the time more detailed explanation of mapWIthState is added the > reference to StatefulNetworkWordCount.scala should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13230. --- Resolution: Not A Problem OK, considering this not a problem (in Spark) for now, with the noted workaround. > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13700) Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation
[ https://issues.apache.org/jira/browse/SPARK-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13700. --- Resolution: Not A Problem I think this might be best to float on a list first, since I don't think this would be implemented. Some things like this already exist in {{AsyncRDDActions}} I have the impression they're not quite deprecated but also something that isn't going to be added to. The semantics are tough to get right. However you seem to be talking about something else, where you implement N expensive synchronous calls in a map operation. You can already do this yourself with mapPartitions and a thread pool. > Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation > - > > Key: SPARK-13700 > URL: https://issues.apache.org/jira/browse/SPARK-13700 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Paulo Costa >Priority: Minor > Labels: async, features, rdd, transform > > Spark is great for synchronous operations. > But sometimes I need to call a database/web server/etc from my transform, and > the Spark pipeline stalls waiting for it. > Avoiding that would be great! > I suggest we add a new method RDD.mapAsync(), which can execute these > operations concurrently, avoiding the bottleneck. > I've written a quick'n'dirty implementation of what I have in mind: > https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3 > What do you think? > If you agree with this feature, I can work on a pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13703) Remove obsolete scala-2.10 source files
[ https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13703. --- Resolution: Not A Problem Yea, for the moment 2.10 support has not been dropped. Dropping it would mean indeed removing files like this. > Remove obsolete scala-2.10 source files > --- > > Key: SPARK-13703 > URL: https://issues.apache.org/jira/browse/SPARK-13703 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Luciano Resende >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13701) MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm))
[ https://issues.apache.org/jira/browse/SPARK-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13701. --- Resolution: Not A Problem Yeah, I don't think Spark works on arm64 because jblas does not. The thing I'm not quite sure about is why it doesn't fail earlier and fall back to Java. On both counts, it sounds like something considered not an issue for Spark. However... jblas hasn't been used in Spark for a while. What version is this? I'm closing this, but you can reply. > MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: > org.jblas.NativeBlas.dgemm)) > -- > > Key: SPARK-13701 > URL: https://issues.apache.org/jira/browse/SPARK-13701 > Project: Spark > Issue Type: Bug > Components: MLlib > Environment: Ubuntu 14.04 on aarch64 >Reporter: Santiago M. Mola >Priority: Minor > Labels: arm64, porting > > jblas fails on arm64. > {code} > ALSSuite: > Exception encountered when attempting to run a suite with class name: > org.apache.spark.mllib.recommendation.ALSSuite *** ABORTED *** (112 > milliseconds) > java.lang.UnsatisfiedLinkError: > org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V > at org.jblas.NativeBlas.dgemm(Native Method) > at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247) > at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781) > at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138) > at > org.apache.spark.mllib.recommendation.ALSSuite$.generateRatings(ALSSuite.scala:74) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13708) Null pointer Exception while starting spark shell
[ https://issues.apache.org/jira/browse/SPARK-13708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13708. --- Resolution: Invalid Please ask questions on u...@spark.apache.org https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark You'd have to say a lot more like how you built Spark, how you're running it, etc. > Null pointer Exception while starting spark shell > - > > Key: SPARK-13708 > URL: https://issues.apache.org/jira/browse/SPARK-13708 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 1.6.0 > Environment: windows platform >Reporter: Sowmya Dureddy > > 16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in > CLASSPATH (or one of dependencies) > 16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in > CLASSPATH (or one of dependencies) > 16/03/06 11:49:24 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/03/06 11:49:24 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > 16/03/06 11:49:25 WARN : Your hostname, LAPTOP-B9GV6F82 resolves to a > loopback/non-reachable address: fe80:0:0:0:1cbd:55b:b3e4:8e04%12, but we > couldn't find any external IP address! > java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) > at > org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159) > at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108) > at > org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at >
[jira] [Created] (SPARK-13708) Null pointer Exception while starting spark shell
Sowmya Dureddy created SPARK-13708: -- Summary: Null pointer Exception while starting spark shell Key: SPARK-13708 URL: https://issues.apache.org/jira/browse/SPARK-13708 Project: Spark Issue Type: Question Components: Spark Shell Affects Versions: 1.6.0 Environment: windows platform Reporter: Sowmya Dureddy 16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16/03/06 11:49:18 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16/03/06 11:49:24 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/03/06 11:49:24 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 16/03/06 11:49:25 WARN : Your hostname, LAPTOP-B9GV6F82 resolves to a loopback/non-reachable address: fe80:0:0:0:1cbd:55b:b3e4:8e04%12, but we couldn't find any external IP address! java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.(:9) at $iwC.(:18) at (:20) at .(:24) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown
[jira] [Comment Edited] (SPARK-10548) Concurrent execution in SQL does not work
[ https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182265#comment-15182265 ] nicerobot edited comment on SPARK-10548 at 3/6/16 6:47 PM: --- I might be misunderstanding the solution but i'm not clear how the implementation addresses the problem. The issue appears to be that the ThreadLocal property {{"spark.sql.execution.id"}} is not handled properly (cleaned up) in thread-pooled environments. The implemented solution is essentially in {{SparkContext}} {code} // Note: make a clone such that changes in the parent properties aren't reflected in // the those of the children threads, which has confusing semantics (SPARK-10563). SerializationUtils.clone(parent).asInstanceOf[Properties] {code} But from what I can tell, the problem isn't related to parent/child threads. It's that {{localProperties}}' {{"spark.sql.execution.id"}} key is retained after a thread completes. When that thread is returned to the pool and reused by another execution, the execution id will remain because it's part of the SparkContext's {{localProperties}}. It seems like a {{"spark.sql.execution.id"}} should be local to an execution context instance (a {{QueryExecution}}?), not global to a thread nor specifically a property of a SQLContext/SparkContext. was (Author: nicerobot): I might be misunderstanding the solution but i'm not clear how the implementation addresses the problem. The issue appears to be that the ThreadLocal property "spark.sql.execution.id" is not handled properly in thread-pooled environments. The implemented solution is essentially in {{SparkContext}} {code} // Note: make a clone such that changes in the parent properties aren't reflected in // the those of the children threads, which has confusing semantics (SPARK-10563). SerializationUtils.clone(parent).asInstanceOf[Properties] {code} But from what I can tell, the problem isn't related to parent/child threads. It's that {{localProperties}}' {{"spark.sql.execution.id"}} key is retained after a thread completes. When that thread is returned to the pool and reused by another execution, the execution id will remain because it's part of the SparkContext's {{localProperties}}. It seems like a {{"spark.sql.execution.id"}} should be local to an execution context instance (a {{QueryExecution}}?), not global to a thread nor specifically a property of a SQLContext/SparkContext. > Concurrent execution in SQL does not work > - > > Key: SPARK-10548 > URL: https://issues.apache.org/jira/browse/SPARK-10548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > From the mailing list: > {code} > future { df1.count() } > future { df2.count() } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > === edit === > Simple reproduction: > {code} > (1 to 100).par.foreach { _ => > sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10548) Concurrent execution in SQL does not work
[ https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182265#comment-15182265 ] nicerobot commented on SPARK-10548: --- I might be misunderstanding the solution but i'm not clear how the implementation addresses the problem. The issue appears to be that the ThreadLocal property "spark.sql.execution.id" is not handled properly in thread-pooled environments. The implemented solution is essentially in {{SparkContext}} {code} // Note: make a clone such that changes in the parent properties aren't reflected in // the those of the children threads, which has confusing semantics (SPARK-10563). SerializationUtils.clone(parent).asInstanceOf[Properties] {code} But from what I can tell, the problem isn't related to parent/child threads. It's that {{localProperties}}' {{"spark.sql.execution.id"}} key is retained after a thread completes. When that thread is returned to the pool and reused by another execution, the execution id will remain because it's part of the SparkContext's {{localProperties}}. It seems like a {{"spark.sql.execution.id"}} should be local to an execution context instance (a {{QueryExecution}}?), not global to a thread nor specifically a property of a SQLContext/SparkContext. > Concurrent execution in SQL does not work > - > > Key: SPARK-10548 > URL: https://issues.apache.org/jira/browse/SPARK-10548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > From the mailing list: > {code} > future { df1.count() } > future { df2.count() } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > === edit === > Simple reproduction: > {code} > (1 to 100).par.foreach { _ => > sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8480) Add setName for Dataframe
[ https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182262#comment-15182262 ] Neelesh Srinivas Salian commented on SPARK-8480: Looking into this. Will post a PR for review soon. Thank you. > Add setName for Dataframe > - > > Key: SPARK-8480 > URL: https://issues.apache.org/jira/browse/SPARK-8480 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.4.0 >Reporter: Peter Rudenko >Priority: Minor > > Rdd has a method setName, so in spark UI, it's more easily to understand > what's this cache for. E.g. ("data for LogisticRegression model", etc.). > Would be nice to have the same method for Dataframe, since it displays a > logical schema, in cache page, which could be quite big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13707) Streaming UI tab misleading for window operations
[ https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182259#comment-15182259 ] Jatin Kumar edited comment on SPARK-13707 at 3/6/16 6:29 PM: - The records shown in image are actually the 2 sec batches which share boundary with the 120sec window batch. was (Author: kjatin): The records shown in image are of the 2 sec batch which share boundry with the 120sec window batch. > Streaming UI tab misleading for window operations > - > > Key: SPARK-13707 > URL: https://issues.apache.org/jira/browse/SPARK-13707 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jatin Kumar > Attachments: Screen Shot 2016-03-06 at 11.09.55 pm.png > > > 'Streaming' tab on spark UI is misleading when the job has a window operation > which changes the batch duration from original streaming context batch > duration. > For instance consider: > {code:java} > val streamingContext = new StreamingContext(sparkConfig, Seconds(2)) > val totalVideoImps = streamingContext.sparkContext.accumulator(0, > "TotalVideoImpressions") > val totalImps = streamingContext.sparkContext.accumulator(0, > "TotalImpressions") > val stream = KafkaReader.KafkaDirectStream(streamingContext) > stream.map(KafkaAdLogParser.parseAdLogRecord) > .filter(record => { > totalImps += 1 > KafkaAdLogParser.isVideoRecord(record) > }) > .map(record => { > totalVideoImps += 1 > record.url > }) > .window(Seconds(120)) > .count().foreachRDD((rdd, time) => { > println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds)) > println("Count: " + rdd.collect()(0)) > println("Total Impressions: " + totalImps.value) > totalImps.setValue(0) > println("Total Video Impressions: " + totalVideoImps.value) > totalVideoImps.setValue(0) > }) > streamingContext.start() > streamingContext.awaitTermination() > {code} > Batch Size before window operation is 2 sec and then after window batches are > of 120 seconds each. > -- > Above code printed following for my application whereas the UI showed > different numbers. > {noformat} > Timestamp: 2016-03-06 12:02:56,000 > Count: 362195 > Total Impressions: 16882431 > Total Video Impressions: 362195 > Timestamp: 2016-03-06 12:04:56,000 > Count: 367168 > Total Impressions: 19480293 > Total Video Impressions: 367168 > Timestamp: 2016-03-06 12:06:56,000 > Count: 177711 > Total Impressions: 10196677 > Total Video Impressions: 177711 > {noformat} > whereas the spark UI shows different numbers as attached in the image. Also > if we check the start and end index of kafka partition offsets reported by > subsequent batch entries on UI, they do not result in all overall continuous > range. All numbers are fine if we remove the window operation though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13707) Streaming UI tab misleading for window operations
[ https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jatin Kumar updated SPARK-13707: Attachment: Screen Shot 2016-03-06 at 11.09.55 pm.png The records shown in image are of the 2 sec batch which share boundry with the 120sec window batch. > Streaming UI tab misleading for window operations > - > > Key: SPARK-13707 > URL: https://issues.apache.org/jira/browse/SPARK-13707 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jatin Kumar > Attachments: Screen Shot 2016-03-06 at 11.09.55 pm.png > > > 'Streaming' tab on spark UI is misleading when the job has a window operation > which changes the batch duration from original streaming context batch > duration. > For instance consider: > {code:java} > val streamingContext = new StreamingContext(sparkConfig, Seconds(2)) > val totalVideoImps = streamingContext.sparkContext.accumulator(0, > "TotalVideoImpressions") > val totalImps = streamingContext.sparkContext.accumulator(0, > "TotalImpressions") > val stream = KafkaReader.KafkaDirectStream(streamingContext) > stream.map(KafkaAdLogParser.parseAdLogRecord) > .filter(record => { > totalImps += 1 > KafkaAdLogParser.isVideoRecord(record) > }) > .map(record => { > totalVideoImps += 1 > record.url > }) > .window(Seconds(120)) > .count().foreachRDD((rdd, time) => { > println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds)) > println("Count: " + rdd.collect()(0)) > println("Total Impressions: " + totalImps.value) > totalImps.setValue(0) > println("Total Video Impressions: " + totalVideoImps.value) > totalVideoImps.setValue(0) > }) > streamingContext.start() > streamingContext.awaitTermination() > {code} > Batch Size before window operation is 2 sec and then after window batches are > of 120 seconds each. > -- > Above code printed following for my application whereas the UI showed > different numbers. > {noformat} > Timestamp: 2016-03-06 12:02:56,000 > Count: 362195 > Total Impressions: 16882431 > Total Video Impressions: 362195 > Timestamp: 2016-03-06 12:04:56,000 > Count: 367168 > Total Impressions: 19480293 > Total Video Impressions: 367168 > Timestamp: 2016-03-06 12:06:56,000 > Count: 177711 > Total Impressions: 10196677 > Total Video Impressions: 177711 > {noformat} > whereas the spark UI shows different numbers as attached in the image. Also > if we check the start and end index of kafka partition offsets reported by > subsequent batch entries on UI, they do not result in all overall continuous > range. All numbers are fine if we remove the window operation though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13707) Streaming UI tab misleading for window operations
[ https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182257#comment-15182257 ] Jatin Kumar commented on SPARK-13707: - Ideally all 2 sec batches should be linked to the final 120 sec batch and one should be able to browse them from UI but I am not aware of the design decisions taken here as it can get quite complex in case of multiple window operations. I would like to work on a fix for this if we can decide on what the behavior should be :) > Streaming UI tab misleading for window operations > - > > Key: SPARK-13707 > URL: https://issues.apache.org/jira/browse/SPARK-13707 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jatin Kumar > > 'Streaming' tab on spark UI is misleading when the job has a window operation > which changes the batch duration from original streaming context batch > duration. > For instance consider: > {code:java} > val streamingContext = new StreamingContext(sparkConfig, Seconds(2)) > val totalVideoImps = streamingContext.sparkContext.accumulator(0, > "TotalVideoImpressions") > val totalImps = streamingContext.sparkContext.accumulator(0, > "TotalImpressions") > val stream = KafkaReader.KafkaDirectStream(streamingContext) > stream.map(KafkaAdLogParser.parseAdLogRecord) > .filter(record => { > totalImps += 1 > KafkaAdLogParser.isVideoRecord(record) > }) > .map(record => { > totalVideoImps += 1 > record.url > }) > .window(Seconds(120)) > .count().foreachRDD((rdd, time) => { > println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds)) > println("Count: " + rdd.collect()(0)) > println("Total Impressions: " + totalImps.value) > totalImps.setValue(0) > println("Total Video Impressions: " + totalVideoImps.value) > totalVideoImps.setValue(0) > }) > streamingContext.start() > streamingContext.awaitTermination() > {code} > Batch Size before window operation is 2 sec and then after window batches are > of 120 seconds each. > -- > Above code printed following for my application whereas the UI showed > different numbers. > {noformat} > Timestamp: 2016-03-06 12:02:56,000 > Count: 362195 > Total Impressions: 16882431 > Total Video Impressions: 362195 > Timestamp: 2016-03-06 12:04:56,000 > Count: 367168 > Total Impressions: 19480293 > Total Video Impressions: 367168 > Timestamp: 2016-03-06 12:06:56,000 > Count: 177711 > Total Impressions: 10196677 > Total Video Impressions: 177711 > {noformat} > whereas the spark UI shows different numbers as attached in the image. Also > if we check the start and end index of kafka partition offsets reported by > subsequent batch entries on UI, they do not result in all overall continuous > range. All numbers are fine if we remove the window operation though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13707) Streaming UI tab misleading for window operations
[ https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jatin Kumar updated SPARK-13707: Description: 'Streaming' tab on spark UI is misleading when the job has a window operation which changes the batch duration from original streaming context batch duration. For instance consider: {code:java} val streamingContext = new StreamingContext(sparkConfig, Seconds(2)) val totalVideoImps = streamingContext.sparkContext.accumulator(0, "TotalVideoImpressions") val totalImps = streamingContext.sparkContext.accumulator(0, "TotalImpressions") val stream = KafkaReader.KafkaDirectStream(streamingContext) stream.map(KafkaAdLogParser.parseAdLogRecord) .filter(record => { totalImps += 1 KafkaAdLogParser.isVideoRecord(record) }) .map(record => { totalVideoImps += 1 record.url }) .window(Seconds(120)) .count().foreachRDD((rdd, time) => { println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds)) println("Count: " + rdd.collect()(0)) println("Total Impressions: " + totalImps.value) totalImps.setValue(0) println("Total Video Impressions: " + totalVideoImps.value) totalVideoImps.setValue(0) }) streamingContext.start() streamingContext.awaitTermination() {code} Batch Size before window operation is 2 sec and then after window batches are of 120 seconds each. -- Above code printed following for my application whereas the UI showed different numbers. {noformat} Timestamp: 2016-03-06 12:02:56,000 Count: 362195 Total Impressions: 16882431 Total Video Impressions: 362195 Timestamp: 2016-03-06 12:04:56,000 Count: 367168 Total Impressions: 19480293 Total Video Impressions: 367168 Timestamp: 2016-03-06 12:06:56,000 Count: 177711 Total Impressions: 10196677 Total Video Impressions: 177711 {noformat} whereas the spark UI shows different numbers as attached in the image. Also if we check the start and end index of kafka partition offsets reported by subsequent batch entries on UI, they do not result in all overall continuous range. All numbers are fine if we remove the window operation though. was: 'Streaming' tab on spark UI is misleading when the job has a window operation which changes the batch duration from original streaming context batch duration. For instance consider: {{code:java}} val streamingContext = new StreamingContext(sparkConfig, Seconds(2)) val totalVideoImps = streamingContext.sparkContext.accumulator(0, "TotalVideoImpressions") val totalImps = streamingContext.sparkContext.accumulator(0, "TotalImpressions") val stream = KafkaReader.KafkaDirectStream(streamingContext) stream.map(KafkaAdLogParser.parseAdLogRecord) .filter(record => { totalImps += 1 KafkaAdLogParser.isVideoRecord(record) }) .map(record => { totalVideoImps += 1 record.url }) .window(Seconds(120)) .count().foreachRDD((rdd, time) => { println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds)) println("Count: " + rdd.collect()(0)) println("Total Impressions: " + totalImps.value) totalImps.setValue(0) println("Total Video Impressions: " + totalVideoImps.value) totalVideoImps.setValue(0) }) streamingContext.start() streamingContext.awaitTermination() {{code}} Batch Size before window operation is 2 sec and then after window batches are of 120 seconds each. -- Above code printed following for my application whereas the UI showed different numbers. {{noformat}} Timestamp: 2016-03-06 12:02:56,000 Count: 362195 Total Impressions: 16882431 Total Video Impressions: 362195 Timestamp: 2016-03-06 12:04:56,000 Count: 367168 Total Impressions: 19480293 Total Video Impressions: 367168 Timestamp: 2016-03-06 12:06:56,000 Count: 177711 Total Impressions: 10196677 Total Video Impressions: 177711 {{noformat}} whereas the spark UI shows different numbers as attached in the image. Also if we check the start and end index of kafka partition offsets reported by subsequent batch entries on UI, they do not result in all overall continuous range. All numbers are fine if we remove the window operation though. > Streaming UI tab misleading for window operations > - > > Key: SPARK-13707 > URL: https://issues.apache.org/jira/browse/SPARK-13707 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jatin Kumar > > 'Streaming' tab on spark UI is misleading when the job has a window operation > which changes the batch duration from original streaming context batch > duration. > For instance consider: > {code:java} > val streamingContext = new StreamingContext(sparkConfig, Seconds(2)) > val totalVideoImps = streamingContext.sparkContext.accumulator(0, > "TotalVideoImpressions") > val totalImps = streamingContext.sparkContext.accumulator(0, > "TotalImpressions") > val stream
[jira] [Created] (SPARK-13707) Streaming UI tab misleading for window operations
Jatin Kumar created SPARK-13707: --- Summary: Streaming UI tab misleading for window operations Key: SPARK-13707 URL: https://issues.apache.org/jira/browse/SPARK-13707 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.6.0 Reporter: Jatin Kumar 'Streaming' tab on spark UI is misleading when the job has a window operation which changes the batch duration from original streaming context batch duration. For instance consider: {{code:java}} val streamingContext = new StreamingContext(sparkConfig, Seconds(2)) val totalVideoImps = streamingContext.sparkContext.accumulator(0, "TotalVideoImpressions") val totalImps = streamingContext.sparkContext.accumulator(0, "TotalImpressions") val stream = KafkaReader.KafkaDirectStream(streamingContext) stream.map(KafkaAdLogParser.parseAdLogRecord) .filter(record => { totalImps += 1 KafkaAdLogParser.isVideoRecord(record) }) .map(record => { totalVideoImps += 1 record.url }) .window(Seconds(120)) .count().foreachRDD((rdd, time) => { println("Timestamp: " + ImpressionAggregator.millsToDate(time.milliseconds)) println("Count: " + rdd.collect()(0)) println("Total Impressions: " + totalImps.value) totalImps.setValue(0) println("Total Video Impressions: " + totalVideoImps.value) totalVideoImps.setValue(0) }) streamingContext.start() streamingContext.awaitTermination() {{code}} Batch Size before window operation is 2 sec and then after window batches are of 120 seconds each. -- Above code printed following for my application whereas the UI showed different numbers. {{noformat}} Timestamp: 2016-03-06 12:02:56,000 Count: 362195 Total Impressions: 16882431 Total Video Impressions: 362195 Timestamp: 2016-03-06 12:04:56,000 Count: 367168 Total Impressions: 19480293 Total Video Impressions: 367168 Timestamp: 2016-03-06 12:06:56,000 Count: 177711 Total Impressions: 10196677 Total Video Impressions: 177711 {{noformat}} whereas the spark UI shows different numbers as attached in the image. Also if we check the start and end index of kafka partition offsets reported by subsequent batch entries on UI, they do not result in all overall continuous range. All numbers are fine if we remove the window operation though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'
[ https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13697. Resolution: Fixed Fix Version/s: 1.4.2 1.6.1 1.5.3 2.0.0 Issue resolved by pull request 11535 [https://github.com/apache/spark/pull/11535] > TransformFunctionSerializer.loads doesn't restore the function's module name > if it's '__main__' > --- > > Key: SPARK-13697 > URL: https://issues.apache.org/jira/browse/SPARK-13697 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0, 1.5.3, 1.6.1, 1.4.2 > > > Here is a reproducer > {code} > >>> from pyspark.streaming import StreamingContext > >>> from pyspark.streaming.util import TransformFunction > >>> ssc = StreamingContext(sc, 1) > >>> func = TransformFunction(sc, lambda x: x, sc.serializer) > >>> func.rdd_wrapper(lambda x: x) > TransformFunction( at 0x106ac8b18>) > >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, > >>> func.rdd_wrap_func, func.deserializers))) > >>> func2 = ssc._transformerSerializer.loads(bytes) > >>> print(func2.func.__module__) > None > >>> print(func2.rdd_wrap_func.__module__) > None > >>> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns
[ https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh Gupta updated SPARK-12313: Comment: was deleted (was: [~liancheng] Should this issue occur after your PR https://github.com/apache/spark/pull/7492 ?) > getPartitionsByFilter doesnt handle predicates on all / multiple Partition > Columns > -- > > Key: SPARK-12313 > URL: https://issues.apache.org/jira/browse/SPARK-12313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Gobinathan SP >Assignee: Harsh Gupta >Priority: Critical > > When enabled spark.sql.hive.metastorePartitionPruning, the > getPartitionsByFilter is used > For a table partitioned by p1 and p2, when triggered hc.sql("select col > from tabl1 where p1='p1V' and p2= 'p2V' ") > The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' > and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as > filter string. > On these cases the partitions are not returned from Hive's > getPartitionsByFilter method. As a result, for the sql, the number of > returned rows is always zero. > However, filter on a single column always works. Probalbly it doesn't come > through this route > I'm using Oracle for Metstore V0.13.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns
[ https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182221#comment-15182221 ] Harsh Gupta commented on SPARK-12313: - Hi [~lian cheng] . Should this issue occur even after your PR https://github.com/apache/spark/pull/7492 ? > getPartitionsByFilter doesnt handle predicates on all / multiple Partition > Columns > -- > > Key: SPARK-12313 > URL: https://issues.apache.org/jira/browse/SPARK-12313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Gobinathan SP >Assignee: Harsh Gupta >Priority: Critical > > When enabled spark.sql.hive.metastorePartitionPruning, the > getPartitionsByFilter is used > For a table partitioned by p1 and p2, when triggered hc.sql("select col > from tabl1 where p1='p1V' and p2= 'p2V' ") > The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' > and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as > filter string. > On these cases the partitions are not returned from Hive's > getPartitionsByFilter method. As a result, for the sql, the number of > returned rows is always zero. > However, filter on a single column always works. Probalbly it doesn't come > through this route > I'm using Oracle for Metstore V0.13.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns
[ https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182220#comment-15182220 ] Harsh Gupta commented on SPARK-12313: - [~liancheng] Should this issue occur after your PR https://github.com/apache/spark/pull/7492 ? > getPartitionsByFilter doesnt handle predicates on all / multiple Partition > Columns > -- > > Key: SPARK-12313 > URL: https://issues.apache.org/jira/browse/SPARK-12313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Gobinathan SP >Assignee: Harsh Gupta >Priority: Critical > > When enabled spark.sql.hive.metastorePartitionPruning, the > getPartitionsByFilter is used > For a table partitioned by p1 and p2, when triggered hc.sql("select col > from tabl1 where p1='p1V' and p2= 'p2V' ") > The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' > and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as > filter string. > On these cases the partitions are not returned from Hive's > getPartitionsByFilter method. As a result, for the sql, the number of > returned rows is always zero. > However, filter on a single column always works. Probalbly it doesn't come > through this route > I'm using Oracle for Metstore V0.13.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13701) MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm))
[ https://issues.apache.org/jira/browse/SPARK-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182214#comment-15182214 ] Santiago M. Mola commented on SPARK-13701: -- Installed gfortran. Now it fails on NLSSuite, then ALSSuite succeeds. {code} [info] NNLSSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.mllib.optimization.NNLSSuite *** ABORTED *** (68 milliseconds) [info] java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V [info] at org.jblas.NativeBlas.dgemm(Native Method) [info] at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247) [info] at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781) [info] at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138) [info] at org.apache.spark.mllib.optimization.NNLSSuite.genOnesData(NNLSSuite.scala:33) [info] at org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(NNLSSuite.scala:56) [info] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166) [info] at org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2.apply$mcV$sp(NNLSSuite.scala:55) [info] at org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2.apply(NNLSSuite.scala:45) [info] at org.apache.spark.mllib.optimization.NNLSSuite$$anonfun$2.apply(NNLSSuite.scala:45) {code} > MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: > org.jblas.NativeBlas.dgemm)) > -- > > Key: SPARK-13701 > URL: https://issues.apache.org/jira/browse/SPARK-13701 > Project: Spark > Issue Type: Bug > Components: MLlib > Environment: Ubuntu 14.04 on aarch64 >Reporter: Santiago M. Mola >Priority: Minor > Labels: arm64, porting > > jblas fails on arm64. > {code} > ALSSuite: > Exception encountered when attempting to run a suite with class name: > org.apache.spark.mllib.recommendation.ALSSuite *** ABORTED *** (112 > milliseconds) > java.lang.UnsatisfiedLinkError: > org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V > at org.jblas.NativeBlas.dgemm(Native Method) > at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247) > at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781) > at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138) > at > org.apache.spark.mllib.recommendation.ALSSuite$.generateRatings(ALSSuite.scala:74) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13706) Python Example for Train Validation Split Missing
[ https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy updated SPARK-13706: --- Description: An example of how to use TrainValidationSplit in pyspark needs to be added. Should be consistent with the current examples. I'll submit a PR. (was: And example of how to use TrainValidationSplit in pyspark needs to be added. Should be consistent with the current examples. I'll submit a PR.) > Python Example for Train Validation Split Missing > - > > Key: SPARK-13706 > URL: https://issues.apache.org/jira/browse/SPARK-13706 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Reporter: Jeremy >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > An example of how to use TrainValidationSplit in pyspark needs to be added. > Should be consistent with the current examples. I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13706) Python Example for Train Validation Split Missing
[ https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182212#comment-15182212 ] Apache Spark commented on SPARK-13706: -- User 'JeremyNixon' has created a pull request for this issue: https://github.com/apache/spark/pull/11547 > Python Example for Train Validation Split Missing > - > > Key: SPARK-13706 > URL: https://issues.apache.org/jira/browse/SPARK-13706 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Reporter: Jeremy >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > And example of how to use TrainValidationSplit in pyspark needs to be added. > Should be consistent with the current examples. I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13706) Python Example for Train Validation Split Missing
[ https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13706: Assignee: (was: Apache Spark) > Python Example for Train Validation Split Missing > - > > Key: SPARK-13706 > URL: https://issues.apache.org/jira/browse/SPARK-13706 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Reporter: Jeremy >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > And example of how to use TrainValidationSplit in pyspark needs to be added. > Should be consistent with the current examples. I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13706) Python Example for Train Validation Split Missing
[ https://issues.apache.org/jira/browse/SPARK-13706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13706: Assignee: Apache Spark > Python Example for Train Validation Split Missing > - > > Key: SPARK-13706 > URL: https://issues.apache.org/jira/browse/SPARK-13706 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Reporter: Jeremy >Assignee: Apache Spark >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > And example of how to use TrainValidationSplit in pyspark needs to be added. > Should be consistent with the current examples. I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13706) Python Example for Train Validation Split Missing
Jeremy created SPARK-13706: -- Summary: Python Example for Train Validation Split Missing Key: SPARK-13706 URL: https://issues.apache.org/jira/browse/SPARK-13706 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Reporter: Jeremy Priority: Minor And example of how to use TrainValidationSplit in pyspark needs to be added. Should be consistent with the current examples. I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org