[jira] [Updated] (SPARK-21545) pyspark2
[ https://issues.apache.org/jira/browse/SPARK-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gumpcheng updated SPARK-21545: -- Description: I install spark2.2 following the official steps with CDH5.12. Info on Cloudera Manager is okay! But I failed to initialize pyspark2. My Environment : Python3.6.1,JDK1.8,CDH5.12 The problem make me crazy for several days. And I found no way to solve it. Anyone can help me? Very thank you!!! [hdfs@Master /data/soft/spark2.2]$ pyspark2 Python 3.6.1 (default, Jul 27 2017, 11:07:01) [GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) at org.apache.spark.SparkContext.(SparkContext.scala:509) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) 17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2 java.lang.NullPointerException at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141) at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90) at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317) at org.apache.spark.SparkContext.stop(SparkContext.scala:1936) at org.apache.spark.SparkContext.(SparkContext.scala:587) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52: UserWarning: Fall back to non-hive support because failing to access HiveConf, please make sure you build spark with hive warnings.warn("Fall back to non-hive support because failing to access HiveConf, " 17/07/27 12:02:09 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at: org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[jira] [Updated] (SPARK-21545) pyspark2
[ https://issues.apache.org/jira/browse/SPARK-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gumpcheng updated SPARK-21545: -- Description: I install spark2.2 following the official steps with CDH5.12. Info on Cloudera Manager is okay! But I failed to initialize pyspark2. My Environment : Python3.6.1,JDK1.8,CDH5.12 [hdfs@Master /data/soft/spark2.2]$ pyspark2 Python 3.6.1 (default, Jul 27 2017, 11:07:01) [GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) at org.apache.spark.SparkContext.(SparkContext.scala:509) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) 17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2 java.lang.NullPointerException at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141) at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90) at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317) at org.apache.spark.SparkContext.stop(SparkContext.scala:1936) at org.apache.spark.SparkContext.(SparkContext.scala:587) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52: UserWarning: Fall back to non-hive support because failing to access HiveConf, please make sure you build spark with hive warnings.warn("Fall back to non-hive support because failing to access HiveConf, " 17/07/27 12:02:09 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at: org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[jira] [Created] (SPARK-21545) pyspark2
gumpcheng created SPARK-21545: - Summary: pyspark2 Key: SPARK-21545 URL: https://issues.apache.org/jira/browse/SPARK-21545 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Environment: Spark2.2 with CDH5.12,python3.6.1,java jdk1.8_b31. Reporter: gumpcheng Fix For: 2.2.0 [hdfs@Master /data/soft/spark2.2]$ pyspark2 Python 3.6.1 (default, Jul 27 2017, 11:07:01) [GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) at org.apache.spark.SparkContext.(SparkContext.scala:509) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) 17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2 java.lang.NullPointerException at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141) at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90) at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317) at org.apache.spark.SparkContext.stop(SparkContext.scala:1936) at org.apache.spark.SparkContext.(SparkContext.scala:587) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52: UserWarning: Fall back to non-hive support because failing to access HiveConf, please make sure you build spark with hive warnings.warn("Fall back to non-hive support because failing to access HiveConf, " 17/07/27 12:02:09 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at: org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[jira] [Resolved] (SPARK-21530) Update description of spark.shuffle.maxChunksBeingTransferred
[ https://issues.apache.org/jira/browse/SPARK-21530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21530. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18735 [https://github.com/apache/spark/pull/18735] > Update description of spark.shuffle.maxChunksBeingTransferred > - > > Key: SPARK-21530 > URL: https://issues.apache.org/jira/browse/SPARK-21530 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Thomas Graves > Fix For: 2.3.0 > > > We should expand the description of spark.shuffle.maxChunksBeingTransferred > to include what happens when the max is hit. In this case it actually closes > incoming connections so if the retry logic on the reducer doesn't happen > could end up in failure. > The way its currently worded doesn't tell users bad things can happen when > the max is hit. > introduced with https://github.com/apache/spark/pull/18388/files > SPARK-21175 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21530) Update description of spark.shuffle.maxChunksBeingTransferred
[ https://issues.apache.org/jira/browse/SPARK-21530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21530: --- Assignee: jin xing > Update description of spark.shuffle.maxChunksBeingTransferred > - > > Key: SPARK-21530 > URL: https://issues.apache.org/jira/browse/SPARK-21530 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: jin xing > Fix For: 2.3.0 > > > We should expand the description of spark.shuffle.maxChunksBeingTransferred > to include what happens when the max is hit. In this case it actually closes > incoming connections so if the retry logic on the reducer doesn't happen > could end up in failure. > The way its currently worded doesn't tell users bad things can happen when > the max is hit. > introduced with https://github.com/apache/spark/pull/18388/files > SPARK-21175 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21400) Spark shouldn't ignore user defined output committer in append mode
[ https://issues.apache.org/jira/browse/SPARK-21400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski closed SPARK-21400. - Resolution: Fixed Fix Version/s: 2.3.0 > Spark shouldn't ignore user defined output committer in append mode > --- > > Key: SPARK-21400 > URL: https://issues.apache.org/jira/browse/SPARK-21400 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Robert Kruszewski > Fix For: 2.3.0 > > > In https://issues.apache.org/jira/browse/SPARK-8578 we decided to override > user defined output committers in append mode. The reasoning was that there's > some output committers that can lead to correctness issues. Since then we > have removed DirectParquetOutputCommitter (the biggest known offender) from > codebase and rely on default implementations. > I believe that we shouldn't be restricting this anymore and users should > understand that if they're overwriting this configuration they have tested > their committer for correctness. This unblocks using more sophisticated and > performant output committers without need to overwrite file format > implementations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21544) Test jar of some module should not install or deploy twice
zhoukang created SPARK-21544: Summary: Test jar of some module should not install or deploy twice Key: SPARK-21544 URL: https://issues.apache.org/jira/browse/SPARK-21544 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.1.0, 1.6.1 Reporter: zhoukang For moudle below: common/network-common streaming sql/core sql/catalyst tests.jar will install or deploy twice.Like: {code:java} [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar [DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged {code} The reason is below: {code:java} [DEBUG] (f) artifact = org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT [DEBUG] (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 -mdh2.1.0.1-SNAPSHOT] {code} when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21400) Spark shouldn't ignore user defined output committer in append mode
[ https://issues.apache.org/jira/browse/SPARK-21400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102638#comment-16102638 ] Robert Kruszewski commented on SPARK-21400: --- Fixed in [https://github.com/apache/spark/pull/18689 ] > Spark shouldn't ignore user defined output committer in append mode > --- > > Key: SPARK-21400 > URL: https://issues.apache.org/jira/browse/SPARK-21400 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Robert Kruszewski > > In https://issues.apache.org/jira/browse/SPARK-8578 we decided to override > user defined output committers in append mode. The reasoning was that there's > some output committers that can lead to correctness issues. Since then we > have removed DirectParquetOutputCommitter (the biggest known offender) from > codebase and rely on default implementations. > I believe that we shouldn't be restricting this anymore and users should > understand that if they're overwriting this configuration they have tested > their committer for correctness. This unblocks using more sophisticated and > performant output committers without need to overwrite file format > implementations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21543) Should not count executor initialize failed towards task failures
zhoukang created SPARK-21543: Summary: Should not count executor initialize failed towards task failures Key: SPARK-21543 URL: https://issues.apache.org/jira/browse/SPARK-21543 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.1.0, 1.6.1 Reporter: zhoukang Till now, when executor init failed and exit with error code = 1, it will count toward task failures.Which i think should not count executor initialize failed towards task failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
[ https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101686#comment-16101686 ] zhoukang edited comment on SPARK-21539 at 7/27/17 2:06 AM: --- I want to report node blacklist to yarn real-time, does this make sense?[~squito] was (Author: cane): I am working on this. > Job should not be aborted when dynamic allocation is enabled or > spark.executor.instances larger then current allocated number by yarn > - > > Key: SPARK-21539 > URL: https://issues.apache.org/jira/browse/SPARK-21539 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For spark on yarn. > Right now, when TaskSet can not run on any node or host.Which means > blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. > However, if dynamic allocation is enabled, we should wait for yarn to > allocate new nodemanager in order to execute job successfully. > How to reproduce? > 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu > core and memory,which can let yarn launch container on this node even it is > blacklisted by TaskScheduler. > 2、modify BlockManager#registerWithExternalShuffleServer > {code:java} > logInfo("Registering executor with local external shuffle service.") > val shuffleConfig = new ExecutorShuffleInfo( > diskBlockManager.localDirs.map(_.toString), > diskBlockManager.subDirsPerLocalDir, > shuffleManager.getClass.getName) > val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) > val SLEEP_TIME_SECS = 5 > for (i <- 1 to MAX_ATTEMPTS) { > try { > {color:red}if (shuffleId.host.equals("node1's address")) { > throw new Exception > }{color} > // Synchronous and will throw an exception if we cannot connect. > > shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( > shuffleServerId.host, shuffleServerId.port, > shuffleServerId.executorId, shuffleConfig) > return > } catch { > case e: Exception if i < MAX_ATTEMPTS => > logError(s"Failed to connect to external shuffle server, will retry > ${MAX_ATTEMPTS - i}" > + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) > Thread.sleep(SLEEP_TIME_SECS * 1000) > case NonFatal(e) => > throw new SparkException("Unable to register with external shuffle > server due to : " + > e.getMessage, e) > } > } > {code} > add logic in red. > 3、set shuffle service enable as true and open shuffle service for yarn. > Then yarn will always launch executor on node1 but failed since shuffle > service can not register success. > Then job will be aborted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
[ https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21539: - Affects Version/s: (was: 2.2.0) > Job should not be aborted when dynamic allocation is enabled or > spark.executor.instances larger then current allocated number by yarn > - > > Key: SPARK-21539 > URL: https://issues.apache.org/jira/browse/SPARK-21539 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For spark on yarn. > Right now, when TaskSet can not run on any node or host.Which means > blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. > However, if dynamic allocation is enabled, we should wait for yarn to > allocate new nodemanager in order to execute job successfully. > How to reproduce? > 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu > core and memory,which can let yarn launch container on this node even it is > blacklisted by TaskScheduler. > 2、modify BlockManager#registerWithExternalShuffleServer > {code:java} > logInfo("Registering executor with local external shuffle service.") > val shuffleConfig = new ExecutorShuffleInfo( > diskBlockManager.localDirs.map(_.toString), > diskBlockManager.subDirsPerLocalDir, > shuffleManager.getClass.getName) > val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) > val SLEEP_TIME_SECS = 5 > for (i <- 1 to MAX_ATTEMPTS) { > try { > {color:red}if (shuffleId.host.equals("node1's address")) { > throw new Exception > }{color} > // Synchronous and will throw an exception if we cannot connect. > > shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( > shuffleServerId.host, shuffleServerId.port, > shuffleServerId.executorId, shuffleConfig) > return > } catch { > case e: Exception if i < MAX_ATTEMPTS => > logError(s"Failed to connect to external shuffle server, will retry > ${MAX_ATTEMPTS - i}" > + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) > Thread.sleep(SLEEP_TIME_SECS * 1000) > case NonFatal(e) => > throw new SparkException("Unable to register with external shuffle > server due to : " + > e.getMessage, e) > } > } > {code} > add logic in red. > 3、set shuffle service enable as true and open shuffle service for yarn. > Then yarn will always launch executor on node1 but failed since shuffle > service can not register success. > Then job will be aborted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
[ https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang reopened SPARK-21539: -- > Job should not be aborted when dynamic allocation is enabled or > spark.executor.instances larger then current allocated number by yarn > - > > Key: SPARK-21539 > URL: https://issues.apache.org/jira/browse/SPARK-21539 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For spark on yarn. > Right now, when TaskSet can not run on any node or host.Which means > blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. > However, if dynamic allocation is enabled, we should wait for yarn to > allocate new nodemanager in order to execute job successfully. > How to reproduce? > 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu > core and memory,which can let yarn launch container on this node even it is > blacklisted by TaskScheduler. > 2、modify BlockManager#registerWithExternalShuffleServer > {code:java} > logInfo("Registering executor with local external shuffle service.") > val shuffleConfig = new ExecutorShuffleInfo( > diskBlockManager.localDirs.map(_.toString), > diskBlockManager.subDirsPerLocalDir, > shuffleManager.getClass.getName) > val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) > val SLEEP_TIME_SECS = 5 > for (i <- 1 to MAX_ATTEMPTS) { > try { > {color:red}if (shuffleId.host.equals("node1's address")) { > throw new Exception > }{color} > // Synchronous and will throw an exception if we cannot connect. > > shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( > shuffleServerId.host, shuffleServerId.port, > shuffleServerId.executorId, shuffleConfig) > return > } catch { > case e: Exception if i < MAX_ATTEMPTS => > logError(s"Failed to connect to external shuffle server, will retry > ${MAX_ATTEMPTS - i}" > + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) > Thread.sleep(SLEEP_TIME_SECS * 1000) > case NonFatal(e) => > throw new SparkException("Unable to register with external shuffle > server due to : " + > e.getMessage, e) > } > } > {code} > add logic in red. > 3、set shuffle service enable as true and open shuffle service for yarn. > Then yarn will always launch executor on node1 but failed since shuffle > service can not register success. > Then job will be aborted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21533) "configure(...)" method not called when using Hive Generic UDFs
[ https://issues.apache.org/jira/browse/SPARK-21533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102598#comment-16102598 ] Feng Zhu commented on SPARK-21533: -- Could you post any examples? > "configure(...)" method not called when using Hive Generic UDFs > --- > > Key: SPARK-21533 > URL: https://issues.apache.org/jira/browse/SPARK-21533 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.1 >Reporter: Dean Gurvitz >Priority: Minor > > Using Spark 2.1.1 and Java API, when executing a Hive Generic UDF through the > Spark SQL API, the configure() method in it is not called prior to the > initialize/evaluate methods as expected. > The method configure receives a MapredContext object. It is possible to > construct a version of such an object adjusted to Spark, and therefore > configure should be called to enable a smooth execution of all Hive Generic > UDFs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
[ https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang closed SPARK-21539. Resolution: Not A Problem > Job should not be aborted when dynamic allocation is enabled or > spark.executor.instances larger then current allocated number by yarn > - > > Key: SPARK-21539 > URL: https://issues.apache.org/jira/browse/SPARK-21539 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.1.0, 2.2.0 >Reporter: zhoukang > > For spark on yarn. > Right now, when TaskSet can not run on any node or host.Which means > blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. > However, if dynamic allocation is enabled, we should wait for yarn to > allocate new nodemanager in order to execute job successfully. > How to reproduce? > 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu > core and memory,which can let yarn launch container on this node even it is > blacklisted by TaskScheduler. > 2、modify BlockManager#registerWithExternalShuffleServer > {code:java} > logInfo("Registering executor with local external shuffle service.") > val shuffleConfig = new ExecutorShuffleInfo( > diskBlockManager.localDirs.map(_.toString), > diskBlockManager.subDirsPerLocalDir, > shuffleManager.getClass.getName) > val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) > val SLEEP_TIME_SECS = 5 > for (i <- 1 to MAX_ATTEMPTS) { > try { > {color:red}if (shuffleId.host.equals("node1's address")) { > throw new Exception > }{color} > // Synchronous and will throw an exception if we cannot connect. > > shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( > shuffleServerId.host, shuffleServerId.port, > shuffleServerId.executorId, shuffleConfig) > return > } catch { > case e: Exception if i < MAX_ATTEMPTS => > logError(s"Failed to connect to external shuffle server, will retry > ${MAX_ATTEMPTS - i}" > + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) > Thread.sleep(SLEEP_TIME_SECS * 1000) > case NonFatal(e) => > throw new SparkException("Unable to register with external shuffle > server due to : " + > e.getMessage, e) > } > } > {code} > add logic in red. > 3、set shuffle service enable as true and open shuffle service for yarn. > Then yarn will always launch executor on node1 but failed since shuffle > service can not register success. > Then job will be aborted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21540) add spark.sql.functions.map_keys and spark.sql.functions.map_values
[ https://issues.apache.org/jira/browse/SPARK-21540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21540. -- Resolution: Duplicate I guess this was added in SPARK-19975. > add spark.sql.functions.map_keys and spark.sql.functions.map_values > --- > > Key: SPARK-21540 > URL: https://issues.apache.org/jira/browse/SPARK-21540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: yu peng > > for dataframe/sql we support MapType, but unlike ArrayType, we can explode to > unpack from it. > getItems is just not powerful enough for manipulate MapTypes > df.select(map_keys('$map')) and df.select(map_values('$map')) would be really > useful -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21542: -- Description: Currently, there is no way to easily persist Json-serializable parameters in Python only. All parameters in Python are persisted by converting them to Java objects and using the Java persistence implementation. In order to facilitate the creation of custom Python-only pipeline stages, it would be good to have a Python-only persistence framework so that these stages do not need to be implemented in Scala for persistence. This task involves: - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, DefaultParamsReader, and DefaultParamsWriter in pyspark. was: Currnetly, there is no way to easily persist Json-serializable parameters in Python only. All parameters in Python are persisted by converting them to Java objects and using the Java persistence implementation. In order to facilitate the creation of custom Python-only pipeline stages, it would be good to have a Python-only persistence framework so that these stages do not need to be implemented in Scala for persistence. This task involves: - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, DefaultParamsReader, and DefaultParamsWriter in pyspark. > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21542: -- Component/s: ML > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini > > Currnetly, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Saini updated SPARK-21542: --- Component/s: (was: ML) > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini > > Currnetly, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21542) Helper functions for custom Python Persistence
Ajay Saini created SPARK-21542: -- Summary: Helper functions for custom Python Persistence Key: SPARK-21542 URL: https://issues.apache.org/jira/browse/SPARK-21542 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Ajay Saini Currnetly, there is no way to easily persist Json-serializable parameters in Python only. All parameters in Python are persisted by converting them to Java objects and using the Java persistence implementation. In order to facilitate the creation of custom Python-only pipeline stages, it would be good to have a Python-only persistence framework so that these stages do not need to be implemented in Scala for persistence. This task involves: - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102420#comment-16102420 ] Bryan Cutler commented on SPARK-21190: -- Hi [~icexelloss], yes I think there is definitely some basic framework needed to enable vectorized UDFs in Python that could be done before any API improvements. Like you pointed out in (3) the PythonRDD and pyspark.worker will need to be more flexible to handle a different udf, as well as some work on the pyspark.serializer to produce/consume vector data. I was hoping [#18659|https://github.com/apache/spark/pull/18659] could serve as this basis. I plan on updating it to use the {{ArrowColumnVector}} and {{ArrowWriter}} from [~ueshin] when that gets merged. > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > A few things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > 3. How do we handle null values, since Pandas doesn't have the concept of > nulls? > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a, df.b)) > {code} > > > > *Optional Design Sketch* > I’m more concerned about getting proper feedback for API design. The > implementation should be pretty straightforward and is not a huge concern at > this point. We can leverage the same implementation for faster
[jira] [Updated] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext
[ https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-21541: - Description: If you run a spark job without creating the SparkSession or SparkContext, the spark job logs says it succeeded but yarn says it fails and retries 3 times. Also, since, Application Master unregisters with Resource Manager and exits successfully, it deletes the spark staging directory, so when yarn makes subsequent retries, it fails to find the staging directory and thus, the retries fail. *Steps:* 1. For example, run a pyspark job without creating SparkSession or SparkContext. *Example:* import sys from random import random from operator import add from pyspark import SparkContext if __name__ == "__main__": print("hello world") 2. Spark will mark it as FAILED. Got to the UI and check the container logs. 3. You will see the following information in the logs: spark: 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED But yarn logs will show: 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State change from FINAL_SAVING to FAILED was: If you run a spark job without creating the SparkSession or SparkContext, the spark job logs says it succeeded but yarn says it fails and retries 3 times. *Steps:* 1. For example, run a pyspark job without creating SparkSession or SparkContext. *Example:* import sys from random import random from operator import add from pyspark import SparkContext if __name__ == "__main__": print("hello world") 2. Spark will mark it as FAILED. Got to the UI and check the container logs. 3. You will see the following information in the logs: spark: 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED But yarn logs will show: 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State change from FINAL_SAVING to FAILED > Spark Logs show incorrect job status for a job that does not create > SparkContext > > > Key: SPARK-21541 > URL: https://issues.apache.org/jira/browse/SPARK-21541 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > If you run a spark job without creating the SparkSession or SparkContext, the > spark job logs says it succeeded but yarn says it fails and retries 3 times. > Also, since, Application Master unregisters with Resource Manager and exits > successfully, it deletes the spark staging directory, so when yarn makes > subsequent retries, it fails to find the staging directory and thus, the > retries fail. > *Steps:* > 1. For example, run a pyspark job without creating SparkSession or > SparkContext. > *Example:* > import sys > from random import random > from operator import add > from pyspark import SparkContext > if __name__ == "__main__": > print("hello world") > 2. Spark will mark it as FAILED. Got to the UI and check the container logs. > 3. You will see the following information in the logs: > spark: > 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0 > 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster > with SUCCEEDED > But yarn logs will show: > 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO > attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State > change from FINAL_SAVING to FAILED -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext
[ https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102308#comment-16102308 ] Parth Gandhi commented on SPARK-21541: -- Currently working on the fix, will file a pull request as soon as it is done. > Spark Logs show incorrect job status for a job that does not create > SparkContext > > > Key: SPARK-21541 > URL: https://issues.apache.org/jira/browse/SPARK-21541 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > If you run a spark job without creating the SparkSession or SparkContext, the > spark job logs says it succeeded but yarn says it fails and retries 3 times. > *Steps:* > 1. For example, run a pyspark job without creating SparkSession or > SparkContext. > *Example:* > import sys > from random import random > from operator import add > from pyspark import SparkContext > if __name__ == "__main__": > print("hello world") > 2. Spark will mark it as FAILED. Got to the UI and check the container logs. > 3. You will see the following information in the logs: > spark: > 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0 > 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster > with SUCCEEDED > But yarn logs will show: > 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO > attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State > change from FINAL_SAVING to FAILED -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext
Parth Gandhi created SPARK-21541: Summary: Spark Logs show incorrect job status for a job that does not create SparkContext Key: SPARK-21541 URL: https://issues.apache.org/jira/browse/SPARK-21541 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.2.0 Reporter: Parth Gandhi Priority: Minor If you run a spark job without creating the SparkSession or SparkContext, the spark job logs says it succeeded but yarn says it fails and retries 3 times. *Steps:* 1. For example, run a pyspark job without creating SparkSession or SparkContext. *Example:* import sys from random import random from operator import add from pyspark import SparkContext if __name__ == "__main__": print("hello world") 2. Spark will mark it as FAILED. Got to the UI and check the container logs. 3. You will see the following information in the logs: spark: 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED But yarn logs will show: 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State change from FINAL_SAVING to FAILED -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20418) multi-label classification support
[ https://issues.apache.org/jira/browse/SPARK-20418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102280#comment-16102280 ] Weichen Xu commented on SPARK-20418: I will work on this. > multi-label classification support > -- > > Key: SPARK-20418 > URL: https://issues.apache.org/jira/browse/SPARK-20418 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > > add multi-label label point and make a binary relevance based classifier > holder. > http://scikit-learn.org/dev/modules/generated/sklearn.multioutput.MultiOutputClassifier.html#sklearn.multioutput.MultiOutputClassifier > ``` > X = [ [ 0, 1, 0], [1, 0, 1]] > y = [[1], [1, 2]] > df = sql.CreateDataframe(zip(X, y), ['x', 'y']) > mlp_clf = > MultiOutputClassifier(LogisticRegression).setFeature('X').setLabel('y') > mlp_clf.fit(df) > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102274#comment-16102274 ] Weichen Xu commented on SPARK-11215: I will take over this feature and create a PR soon. > Add multiple columns support to StringIndexer > - > > Key: SPARK-11215 > URL: https://issues.apache.org/jira/browse/SPARK-11215 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Add multiple columns support to StringIndexer, then users can transform > multiple input columns to multiple output columns simultaneously. See > discussion SPARK-8418. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100860#comment-16100860 ] yuhao yang edited comment on SPARK-21535 at 7/26/17 6:30 PM: - https://github.com/apache/spark/pull/18733 was (Author: yuhaoyan): https://github.com/apache/spark/pulls > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala
[ https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102048#comment-16102048 ] Weichen Xu commented on SPARK-21087: I will work on it. > CrossValidator, TrainValidationSplit should preserve all models after > fitting: Scala > > > Key: SPARK-21087 > URL: https://issues.apache.org/jira/browse/SPARK-21087 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > See parent JIRA -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101994#comment-16101994 ] Weichen Xu commented on SPARK-17025: Because currently, scala calling python will be difficult and changing related code is hard, Maybe we can consider the way that we duplicate the import/export logic by python implementation, adding a python implemented MLWriter for pyspark.ml.PipelineModel, it traverse the list of model, if a model is "javaModel" than call the java-side writer, otherwise just call itself python implemented writer. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101984#comment-16101984 ] Ajay Saini edited comment on SPARK-17025 at 7/26/17 5:40 PM: - I'm currently working on a solution to this that involves building custom persistence support for Python-only pipeline stages. As of now, you cannot persist a pipeline stage in Python unless there is a Java implementation of that stage. The framework I'm working on will make it much easier to implement Python-only persistence of custom stages so that they don't need to rely on Java. This fix will be sent out later this week. was (Author: ajaysaini): I'm currently working on a solution to this that involves building custom persistence support for Python-only pipeline stages. As of now, you cannot persist a pipeline stage in Python unless there is a Java implementation of that stage. The framework I'm working on will make it much easier to implement Python-only persistence of custom stages so that they don't need to rely on Java. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101984#comment-16101984 ] Ajay Saini commented on SPARK-17025: I'm currently working on a solution to this that involves building custom persistence support for Python-only pipeline stages. As of now, you cannot persist a pipeline stage in Python unless there is a Java implementation of that stage. The framework I'm working on will make it much easier to implement Python-only persistence of custom stages so that they don't need to rely on Java. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21485) API Documentation for Spark SQL functions
[ https://issues.apache.org/jira/browse/SPARK-21485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-21485. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.3.0 > API Documentation for Spark SQL functions > - > > Key: SPARK-21485 > URL: https://issues.apache.org/jira/browse/SPARK-21485 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.3.0 > > > It looks we can generate the documentation from {{ExpressionDescription}} and > {{ExpressionInfo}} for Spark's SQL function documentation. > I had some time to play with this so I just made a rough version - > https://spark-test.github.io/sparksqldoc/ > Codes I used are as below : > In {{pyspark}} shell: > {code} > from collections import namedtuple > ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended") > jinfos = > spark.sparkContext._jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctions() > infos = [] > for jinfo in jinfos: > name = jinfo.getName() > usage = jinfo.getUsage() > usage = usage.replace("_FUNC_", name) if usage is not None else usage > extended = jinfo.getExtended() > extended = extended.replace("_FUNC_", name) if extended is not None else > extended > infos.append(ExpressionInfo( > className=jinfo.getClassName(), > usage=usage, > name=name, > extended=extended)) > with open("index.md", 'w') as mdfile: > strip = lambda s: "\n".join(map(lambda u: u.strip(), s.split("\n"))) > for info in sorted(infos, key=lambda i: i.name): > mdfile.write("### %s\n\n" % info.name) > if info.usage is not None: > mdfile.write("%s\n\n" % strip(info.usage)) > if info.extended is not None: > mdfile.write("```%s```\n\n" % strip(info.extended)) > {code} > This change had to be made first before running the codes above: > {code:none} > +++ > b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala > @@ -17,9 +17,15 @@ > package org.apache.spark.sql.api.python > +import org.apache.spark.sql.catalyst.analysis.FunctionRegistry > +import org.apache.spark.sql.catalyst.expressions.ExpressionInfo > import org.apache.spark.sql.catalyst.parser.CatalystSqlParser > import org.apache.spark.sql.types.DataType > private[sql] object PythonSQLUtils { >def parseDataType(typeText: String): DataType = > CatalystSqlParser.parseDataType(typeText) > + > + def listBuiltinFunctions(): Array[ExpressionInfo] = { > +FunctionRegistry.functionSet.flatMap(f => > FunctionRegistry.builtin.lookupFunction(f)).toArray > + } > } > {code} > And then, I ran this: > {code} > mkdir docs > echo "site_name: Spark SQL 2.3.0" >> mkdocs.yml > echo "theme: readthedocs" >> mkdocs.yml > mv index.md docs/index.md > mkdocs serve > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101900#comment-16101900 ] Shivaram Venkataraman commented on SPARK-6809: -- Yeah I dont this JIRA is applicable any more. We can close this > Make numPartitions optional in pairRDD APIs > --- > > Key: SPARK-6809 > URL: https://issues.apache.org/jira/browse/SPARK-6809 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6809. -- Resolution: Not A Problem > Make numPartitions optional in pairRDD APIs > --- > > Key: SPARK-6809 > URL: https://issues.apache.org/jira/browse/SPARK-6809 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21540) add spark.sql.functions.map_keys and spark.sql.functions.map_values
yu peng created SPARK-21540: --- Summary: add spark.sql.functions.map_keys and spark.sql.functions.map_values Key: SPARK-21540 URL: https://issues.apache.org/jira/browse/SPARK-21540 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: yu peng for dataframe/sql we support MapType, but unlike ArrayType, we can explode to unpack from it. getItems is just not powerful enough for manipulate MapTypes df.select(map_keys('$map')) and df.select(map_values('$map')) would be really useful -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21245) Resolve code duplication for classification/regression summarizers
[ https://issues.apache.org/jira/browse/SPARK-21245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-21245: - Labels: starter (was: ) Priority: Minor (was: Major) > Resolve code duplication for classification/regression summarizers > -- > > Key: SPARK-21245 > URL: https://issues.apache.org/jira/browse/SPARK-21245 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.1 >Reporter: Seth Hendrickson >Priority: Minor > Labels: starter > > In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary > information about training data using {{MultivariateOnlineSummarizer}} and > {{MulticlassSummarizer}}. We have the same code appearing in several places > (and including test suites). We can eliminate this by creating a common > implementation somewhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101870#comment-16101870 ] yuhao yang commented on SPARK-21535: The basic idea is that we should release the driver memory as soon as a trained model is evaluated. I don't think there's any conflict. But let me know if there's any, I'll revert the jira. I'm not a big fan for the Parallel CV idea. Personally I cannot see how it improves the overall performance or ease of use. But maybe it's just I never met the appropriate scenarios. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12957) Derive and propagate data constrains in logical plan
[ https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12957. - Resolution: Fixed Fix Version/s: 2.0.0 > Derive and propagate data constrains in logical plan > - > > Key: SPARK-12957 > URL: https://issues.apache.org/jira/browse/SPARK-12957 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > Attachments: ConstraintPropagationinSparkSQL.pdf > > > Based on the semantic of a query plan, we can derive data constrains (e.g. if > a filter defines {{a > 10}}, we know that the output data of this filter > satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a > framework to derive and propagate constrains in the logical plan, which can > help us to build more advanced optimizations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21538) Attribute resolution inconsistency in Dataset API
[ https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21538: Affects Version/s: (was: 3.0.0) 2.3.0 > Attribute resolution inconsistency in Dataset API > - > > Key: SPARK-21538 > URL: https://issues.apache.org/jira/browse/SPARK-21538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu > > {code} > spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works > spark.range(1).withColumnRenamed("id", "x").sort($"id") // works > spark.range(1).withColumnRenamed("id", "x").sort('id) // works > spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: > org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among > (x); > ... > {code} > It looks like the Dataset API functions taking {{String}} use the basic > resolver that only look at the columns at that level, whereas all the other > means of expressing an attribute are lazily resolved during the analyzer. > The reason why the first 3 calls work is explained in the docs for {{object > ResolveMissingReferences}}: > {code} > /** >* In many dialects of SQL it is valid to sort by attributes that are not > present in the SELECT >* clause. This rule detects such queries and adds the required attributes > to the original >* projection, so that they will be available during sorting. Another > projection is added to >* remove these attributes after sorting. >* >* The HAVING clause could also used a grouping columns that is not > presented in the SELECT. >*/ > {code} > For consistency, it would be good to use the same attribute resolution > mechanism everywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21538) Attribute resolution inconsistency in Dataset API
[ https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21538: Issue Type: Improvement (was: Story) > Attribute resolution inconsistency in Dataset API > - > > Key: SPARK-21538 > URL: https://issues.apache.org/jira/browse/SPARK-21538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu > > {code} > spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works > spark.range(1).withColumnRenamed("id", "x").sort($"id") // works > spark.range(1).withColumnRenamed("id", "x").sort('id) // works > spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: > org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among > (x); > ... > {code} > It looks like the Dataset API functions taking {{String}} use the basic > resolver that only look at the columns at that level, whereas all the other > means of expressing an attribute are lazily resolved during the analyzer. > The reason why the first 3 calls work is explained in the docs for {{object > ResolveMissingReferences}}: > {code} > /** >* In many dialects of SQL it is valid to sort by attributes that are not > present in the SELECT >* clause. This rule detects such queries and adds the required attributes > to the original >* projection, so that they will be available during sorting. Another > projection is added to >* remove these attributes after sorting. >* >* The HAVING clause could also used a grouping columns that is not > presented in the SELECT. >*/ > {code} > For consistency, it would be good to use the same attribute resolution > mechanism everywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
[ https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101686#comment-16101686 ] zhoukang commented on SPARK-21539: -- I am working on this. > Job should not be aborted when dynamic allocation is enabled or > spark.executor.instances larger then current allocated number by yarn > - > > Key: SPARK-21539 > URL: https://issues.apache.org/jira/browse/SPARK-21539 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.1.0, 2.2.0 >Reporter: zhoukang > > For spark on yarn. > Right now, when TaskSet can not run on any node or host.Which means > blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. > However, if dynamic allocation is enabled, we should wait for yarn to > allocate new nodemanager in order to execute job successfully. > How to reproduce? > 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu > core and memory,which can let yarn launch container on this node even it is > blacklisted by TaskScheduler. > 2、modify BlockManager#registerWithExternalShuffleServer > {code:java} > logInfo("Registering executor with local external shuffle service.") > val shuffleConfig = new ExecutorShuffleInfo( > diskBlockManager.localDirs.map(_.toString), > diskBlockManager.subDirsPerLocalDir, > shuffleManager.getClass.getName) > val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) > val SLEEP_TIME_SECS = 5 > for (i <- 1 to MAX_ATTEMPTS) { > try { > {color:red}if (shuffleId.host.equals("node1's address")) { > throw new Exception > }{color} > // Synchronous and will throw an exception if we cannot connect. > > shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( > shuffleServerId.host, shuffleServerId.port, > shuffleServerId.executorId, shuffleConfig) > return > } catch { > case e: Exception if i < MAX_ATTEMPTS => > logError(s"Failed to connect to external shuffle server, will retry > ${MAX_ATTEMPTS - i}" > + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) > Thread.sleep(SLEEP_TIME_SECS * 1000) > case NonFatal(e) => > throw new SparkException("Unable to register with external shuffle > server due to : " + > e.getMessage, e) > } > } > {code} > add logic in red. > 3、set shuffle service enable as true and open shuffle service for yarn. > Then yarn will always launch executor on node1 but failed since shuffle > service can not register success. > Then job will be aborted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
[ https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21539: - Description: For spark on yarn. Right now, when TaskSet can not run on any node or host.Which means blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. However, if dynamic allocation is enabled, we should wait for yarn to allocate new nodemanager in order to execute job successfully. How to reproduce? 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu core and memory,which can let yarn launch container on this node even it is blacklisted by TaskScheduler. 2、modify BlockManager#registerWithExternalShuffleServer {code:java} logInfo("Registering executor with local external shuffle service.") val shuffleConfig = new ExecutorShuffleInfo( diskBlockManager.localDirs.map(_.toString), diskBlockManager.subDirsPerLocalDir, shuffleManager.getClass.getName) val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) val SLEEP_TIME_SECS = 5 for (i <- 1 to MAX_ATTEMPTS) { try { {color:red}if (shuffleId.host.equals("node1's address")) { throw new Exception }{color} // Synchronous and will throw an exception if we cannot connect. shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( shuffleServerId.host, shuffleServerId.port, shuffleServerId.executorId, shuffleConfig) return } catch { case e: Exception if i < MAX_ATTEMPTS => logError(s"Failed to connect to external shuffle server, will retry ${MAX_ATTEMPTS - i}" + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) Thread.sleep(SLEEP_TIME_SECS * 1000) case NonFatal(e) => throw new SparkException("Unable to register with external shuffle server due to : " + e.getMessage, e) } } {code} add logic in red. 3、set shuffle service enable as true and open shuffle service for yarn. Then yarn will always launch executor on node1 but failed since shuffle service can not register success. Then job will be aborted. was: For spark on yarn. Right now, when TaskSet can not run on any node or host.Which means blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. However, if dynamic allocation is enabled, we should wait for yarn to allocate new nodemanager in order to execute job successfully.And we should report this information to yarn in case of assign same node which blacklisted by TaskScheduler. How to reproduce? 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu core and memory,which can let yarn launch container on this node even it is blacklisted by TaskScheduler. 2、modify BlockManager#registerWithExternalShuffleServer {code:java} logInfo("Registering executor with local external shuffle service.") val shuffleConfig = new ExecutorShuffleInfo( diskBlockManager.localDirs.map(_.toString), diskBlockManager.subDirsPerLocalDir, shuffleManager.getClass.getName) val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) val SLEEP_TIME_SECS = 5 for (i <- 1 to MAX_ATTEMPTS) { try { {color:red}if (shuffleId.host.equals("node1's address")) { throw new Exception }{color} // Synchronous and will throw an exception if we cannot connect. shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( shuffleServerId.host, shuffleServerId.port, shuffleServerId.executorId, shuffleConfig) return } catch { case e: Exception if i < MAX_ATTEMPTS => logError(s"Failed to connect to external shuffle server, will retry ${MAX_ATTEMPTS - i}" + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) Thread.sleep(SLEEP_TIME_SECS * 1000) case NonFatal(e) => throw new SparkException("Unable to register with external shuffle server due to : " + e.getMessage, e) } } {code} add logic in red. 3、set shuffle service enable as true and open shuffle service for yarn. Then yarn will always launch executor on node1 but failed since shuffle service can not register success. Then job will be aborted. > Job should not be aborted when dynamic allocation is enabled or > spark.executor.instances larger then current allocated number by yarn > - > > Key: SPARK-21539 > URL: https://issues.apache.org/jira/browse/SPARK-21539 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.1.0, 2.2.0 >Reporter: zhoukang > > For spark on
[jira] [Created] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
zhoukang created SPARK-21539: Summary: Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn Key: SPARK-21539 URL: https://issues.apache.org/jira/browse/SPARK-21539 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0, 2.1.0, 1.6.1 Reporter: zhoukang For spark on yarn. Right now, when TaskSet can not run on any node or host.Which means blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. However, if dynamic allocation is enabled, we should wait for yarn to allocate new nodemanager in order to execute job successfully.And we should report this information to yarn in case of assign same node which blacklisted by TaskScheduler. How to reproduce? 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu core and memory,which can let yarn launch container on this node even it is blacklisted by TaskScheduler. 2、modify BlockManager#registerWithExternalShuffleServer {code:java} logInfo("Registering executor with local external shuffle service.") val shuffleConfig = new ExecutorShuffleInfo( diskBlockManager.localDirs.map(_.toString), diskBlockManager.subDirsPerLocalDir, shuffleManager.getClass.getName) val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) val SLEEP_TIME_SECS = 5 for (i <- 1 to MAX_ATTEMPTS) { try { {color:red}if (shuffleId.host.equals("node1's address")) { throw new Exception }{color} // Synchronous and will throw an exception if we cannot connect. shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( shuffleServerId.host, shuffleServerId.port, shuffleServerId.executorId, shuffleConfig) return } catch { case e: Exception if i < MAX_ATTEMPTS => logError(s"Failed to connect to external shuffle server, will retry ${MAX_ATTEMPTS - i}" + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) Thread.sleep(SLEEP_TIME_SECS * 1000) case NonFatal(e) => throw new SparkException("Unable to register with external shuffle server due to : " + e.getMessage, e) } } {code} add logic in red. 3、set shuffle service enable as true and open shuffle service for yarn. Then yarn will always launch executor on node1 but failed since shuffle service can not register success. Then job will be aborted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597 ] Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:10 PM: --- Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so that you can spark-submit 2 job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modifying the connection string in Spark config hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore was (Author: quincy.tw): Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modifying the connection string in Spark config hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597 ] Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:09 PM: --- Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modifying the connection string in Spark config hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore was (Author: quincy.tw): Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modifying the connection string in Spark config hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597 ] Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:08 PM: --- Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modifying the connection string in Spark config hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore was (Author: quincy.tw): Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modifying the connection string in Spark config hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597 ] Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:07 PM: --- Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modifying the connection string in Spark config hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore was (Author: quincy.tw): Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modify the connection string in javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597 ] Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:06 PM: --- Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modify the connection string in javax.jdo.option.ConnectionURL jdbc:derby:memory:/metastore_db;create=true JDBC connect string for a JDBC metastore was (Author: quincy.tw): Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modify the connection string in javax.jdo.option.ConnectionURL jdbc:derby*:memory*:/metastore_db;create=true JDBC connect string for a JDBC metastore > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597 ] Quincy HSIEH commented on SPARK-9776: - Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modify the connection string in javax.jdo.option.ConnectionURL jdbc:derby*:memory*:/metastore_db;create=true JDBC connect string for a JDBC metastore > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597 ] Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:05 PM: --- Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modify the connection string in javax.jdo.option.ConnectionURL jdbc:derby*:memory*:/metastore_db;create=true JDBC connect string for a JDBC metastore was (Author: quincy.tw): Hi, In case that someone who has the same issue and comes to this page : here is the workaround I use to resolve the problem : 1. Configure Spark with Yarn client mode. 2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can spark-submit job to different queue name with --queue yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 3. Switch derby to memory mode by modify the connection string in javax.jdo.option.ConnectionURL jdbc:derby*:memory*:/metastore_db;create=true JDBC connect string for a JDBC metastore > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20988) Convert logistic regression to new aggregator framework
[ https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20988. Resolution: Fixed Fix Version/s: 2.3.0 > Convert logistic regression to new aggregator framework > --- > > Key: SPARK-20988 > URL: https://issues.apache.org/jira/browse/SPARK-20988 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.3.0 > > > Use the hierarchy from SPARK-19762 for logistic regression optimization -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20988) Convert logistic regression to new aggregator framework
[ https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-20988: -- Assignee: Seth Hendrickson > Convert logistic regression to new aggregator framework > --- > > Key: SPARK-20988 > URL: https://issues.apache.org/jira/browse/SPARK-20988 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.3.0 > > > Use the hierarchy from SPARK-19762 for logistic regression optimization -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21538) Attribute resolution inconsistency in Dataset API
Adrian Ionescu created SPARK-21538: -- Summary: Attribute resolution inconsistency in Dataset API Key: SPARK-21538 URL: https://issues.apache.org/jira/browse/SPARK-21538 Project: Spark Issue Type: Story Components: SQL Affects Versions: 3.0.0 Reporter: Adrian Ionescu {code} spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works spark.range(1).withColumnRenamed("id", "x").sort($"id") // works spark.range(1).withColumnRenamed("id", "x").sort('id) // works spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x); ... {code} It looks like the Dataset API functions taking {{String}} use the basic resolver that only look at the columns at that level, whereas all the other means of expressing an attribute are lazily resolved during the analyzer. The reason why the first 3 calls work is explained in the docs for {{object ResolveMissingReferences}}: {code} /** * In many dialects of SQL it is valid to sort by attributes that are not present in the SELECT * clause. This rule detects such queries and adds the required attributes to the original * projection, so that they will be available during sorting. Another projection is added to * remove these attributes after sorting. * * The HAVING clause could also used a grouping columns that is not presented in the SELECT. */ {code} For consistency, it would be good to use the same attribute resolution mechanism everywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21537) toPandas() should handle nested columns (as a Pandas MultiIndex)
[ https://issues.apache.org/jira/browse/SPARK-21537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric O. LEBIGOT (EOL) updated SPARK-21537: -- Description: The conversion of a *PySpark dataframe with nested columns* to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. *columns indexed by a [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*. For example, a dataframe with the following structure: {code:java} >>> df.printSchema() root |-- device_ID: string (nullable = true) |-- time_origin_UTC: timestamp (nullable = true) |-- duration_s: integer (nullable = true) |-- session_time_UTC: timestamp (nullable = true) |-- probes_by_AP: struct (nullable = true) ||-- aa:bb:cc:dd:ee:ff: array (nullable = true) |||-- element: struct (containsNull = true) ||||-- delay_s: float (nullable = true) ||||-- RSSI: short (nullable = true) |-- max_RSSI_info_by_AP: struct (nullable = true) ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true) |||-- delay_s: float (nullable = true) |||-- RSSI: short (nullable = true) {code} yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ nested inside Pandas (through a MultiIndex): {code} >>> df_pandas_version = df.toPandas() >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # >>> Should work! (…) KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI') >>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0] Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6)) >>> type(_) # PySpark type, instead of Pandas! pyspark.sql.types.Row {code} It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly. was: The conversion of a *PySpark dataframe with nested columns* to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. *columns indexed by a [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*. For example, a dataframe with the following structure: {code:java} >>> df.printSchema() root |-- device_ID: string (nullable = true) |-- time_origin_UTC: timestamp (nullable = true) |-- duration_s: integer (nullable = true) |-- session_time_UTC: timestamp (nullable = true) |-- probes_by_AP: struct (nullable = true) ||-- aa:bb:cc:dd:ee:ff: array (nullable = true) |||-- element: struct (containsNull = true) ||||-- delay_s: float (nullable = true) ||||-- RSSI: short (nullable = true) |-- max_RSSI_info_by_AP: struct (nullable = true) ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true) |||-- delay_s: float (nullable = true) |||-- RSSI: short (nullable = true) {code} yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ nested inside Pandas (through a MultiIndex): {code} >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # >>> Should work! (…) KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI') >>> sessions_in_period["max_RSSI_info_by_AP"].iloc[0] Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6)) >>> type(_) # PySpark type, instead of Pandas! pyspark.sql.types.Row {code} It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly. > toPandas() should handle nested columns (as a Pandas MultiIndex) > > > Key: SPARK-21537 > URL: https://issues.apache.org/jira/browse/SPARK-21537 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Eric O. LEBIGOT (EOL) > Labels: pandas > > The conversion of a *PySpark dataframe with nested columns* to Pandas (with > `toPandas()`) does not convert nested columns into their Pandas equivalent, > i.e. *columns indexed by a > [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*. > For example, a dataframe with the following structure: > {code:java} > >>> df.printSchema() > root > |-- device_ID: string (nullable = true) > |-- time_origin_UTC: timestamp (nullable = true) > |-- duration_s: integer (nullable = true) > |-- session_time_UTC: timestamp (nullable = true) > |-- probes_by_AP: struct (nullable = true) > ||-- aa:bb:cc:dd:ee:ff: array (nullable = true) > |||-- element: struct (containsNull = true) > ||||-- delay_s: float (nullable = true) > ||||-- RSSI: short (nullable = true) > |-- max_RSSI_info_by_AP: struct (nullable = true) > ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true) > |||-- delay_s: float (nullable = true) > |||-- RSSI: short (nullable = true) > {code} > yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ > nested inside Pandas
[jira] [Created] (SPARK-21537) toPandas() should handle nested columns (as a Pandas MultiIndex)
Eric O. LEBIGOT (EOL) created SPARK-21537: - Summary: toPandas() should handle nested columns (as a Pandas MultiIndex) Key: SPARK-21537 URL: https://issues.apache.org/jira/browse/SPARK-21537 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.2.0 Reporter: Eric O. LEBIGOT (EOL) The conversion of a *PySpark dataframe with nested columns* to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. *columns indexed by a [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*. For example, a dataframe with the following structure: {code:java} >>> df.printSchema() root |-- device_ID: string (nullable = true) |-- time_origin_UTC: timestamp (nullable = true) |-- duration_s: integer (nullable = true) |-- session_time_UTC: timestamp (nullable = true) |-- probes_by_AP: struct (nullable = true) ||-- aa:bb:cc:dd:ee:ff: array (nullable = true) |||-- element: struct (containsNull = true) ||||-- delay_s: float (nullable = true) ||||-- RSSI: short (nullable = true) |-- max_RSSI_info_by_AP: struct (nullable = true) ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true) |||-- delay_s: float (nullable = true) |||-- RSSI: short (nullable = true) {code} yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ nested inside Pandas (through a MultiIndex): {code} >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # >>> Should work! (…) KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI') >>> sessions_in_period["max_RSSI_info_by_AP"].iloc[0] Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6)) >>> type(_) # PySpark type, instead of Pandas! pyspark.sql.types.Row {code} It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101476#comment-16101476 ] Peng Meng commented on SPARK-21476: --- Hi [~sagraw], could you please test copy pasted the transform method from ProbabilisticClassifier into RandomForestClassificationModel, and added broadcasting inside transform. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101476#comment-16101476 ] Peng Meng edited comment on SPARK-21476 at 7/26/17 10:06 AM: - Hi [~sagraw], could you please test copy pasted the transform method from ProbabilisticClassificationModel into RandomForestClassificationModel, and added broadcasting inside transform. Thanks. was (Author: peng.m...@intel.com): Hi [~sagraw], could you please test copy pasted the transform method from ProbabilisticClassifier into RandomForestClassificationModel, and added broadcasting inside transform. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101459#comment-16101459 ] Hyukjin Kwon commented on SPARK-6809: - Hi [~davies], I can't find the APIs of pairRDD.R file in the R API documentation. Is this JIRA obsolete in favour of SPARK-7230? > Make numPartitions optional in pairRDD APIs > --- > > Key: SPARK-6809 > URL: https://issues.apache.org/jira/browse/SPARK-6809 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101452#comment-16101452 ] Nick Pentreath commented on SPARK-21535: Parallel CV is in progress: https://github.com/apache/spark/pull/16774. How will this work with that? > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files
[ https://issues.apache.org/jira/browse/SPARK-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21524. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18728 [https://github.com/apache/spark/pull/18728] > ValidatorParamsSuiteHelpers generates wrong temp files > -- > > Key: SPARK-21524 > URL: https://issues.apache.org/jira/browse/SPARK-21524 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > Fix For: 2.3.0 > > > ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the > wrong place and does not delete them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files
[ https://issues.apache.org/jira/browse/SPARK-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21524: - Assignee: yuhao yang > ValidatorParamsSuiteHelpers generates wrong temp files > -- > > Key: SPARK-21524 > URL: https://issues.apache.org/jira/browse/SPARK-21524 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 2.3.0 > > > ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the > wrong place and does not delete them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-11046. -- Resolution: Not A Problem I lately touched some codes around here - SPARK-20493. I assume we are lost around here about the actual issue and the benefit from the fix. I am resolving this as it anyway looks obsolete. Please reopen this if I misunderstood. > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101443#comment-16101443 ] Saurabh Agrawal commented on SPARK-21476: - [~peng.m...@intel.com] My streaming application is suffering from increased latency in the stage where I use this model for prediction. In the spark UI for this stage, I could see that task execution time is alright but each of the tasks takes a bigger chunk of the time in deserialization. When I comment out the line where I call model.transform and use dummy values there instead, run the same application in the same environment, the task deserialization time reduces and overall the stage executes significantly faster. I also tried xgboost model with sparkml compatible third party libraries ([https://github.com/komiya-atsushi/xgboost-predictor-java/blob/master/xgboost-predictor-spark/src/main/scala/biz/k11i/xgboost/spark/model/XGBoostBinaryClassificationModel.scala]). Just like RandomForestClassificationModel, this model subclasses ProbabilisticClassificationModel hence it uses the same transform method. I noticed the same kind of task deserialization and stage execution times as with RF model. I cloned this repo and copy pasted the transform method from ProbabilisticClassifier into XGBoostBinaryClassificationModel, only this time I added broadcasting inside transform. This change brought down the execution time significantly. [~srowen] Adding broadcast in ProbabilisticClassifier transform implementation can fix this, i.e. broadcasting the model instance and calling predictRaw, raw2Probability and other row level methods on this broadcast value. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame
[ https://issues.apache.org/jira/browse/SPARK-21536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101418#comment-16101418 ] Hyukjin Kwon commented on SPARK-21536: -- BTW, my try was - https://github.com/apache/spark/pull/18737. I think the fix can be separate (like I did in that PR) or together. I won't mind resolving this as a duplicate if anyone feels strongly so. > Remove the workaroud to allow dots in field names in R's createDataFame > --- > > Key: SPARK-21536 > URL: https://issues.apache.org/jira/browse/SPARK-21536 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > We are currently converting dots to underscore in some cases as below in > SparkR > {code} > > createDataFrame(list(list(1)), "a.a") > SparkDataFrame[a_a:double] > ... > In FUN(X[[i]], ...) : Use a_a instead of a.a as column name > {code} > {code} > > createDataFrame(list(list(a.a = 1))) > SparkDataFrame[a_a:double] > ... > In FUN(X[[i]], ...) : Use a_a instead of a.a as column name > {code} > This looks introduced in the first place due to SPARK-2775 but now it is > fixed in SPARK-6898. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame
[ https://issues.apache.org/jira/browse/SPARK-21536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101410#comment-16101410 ] Hyukjin Kwon commented on SPARK-21536: -- I just realised what I found while fixing this is related with SPARK-12191. > Remove the workaroud to allow dots in field names in R's createDataFame > --- > > Key: SPARK-21536 > URL: https://issues.apache.org/jira/browse/SPARK-21536 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > We are currently converting dots to underscore in some cases as below in > SparkR > {code} > > createDataFrame(list(list(1)), "a.a") > SparkDataFrame[a_a:double] > ... > In FUN(X[[i]], ...) : Use a_a instead of a.a as column name > {code} > {code} > > createDataFrame(list(list(a.a = 1))) > SparkDataFrame[a_a:double] > ... > In FUN(X[[i]], ...) : Use a_a instead of a.a as column name > {code} > This looks introduced in the first place due to SPARK-2775 but now it is > fixed in SPARK-6898. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099477#comment-16099477 ] HanCheol Cho edited comment on SPARK-16784 at 7/26/17 8:27 AM: --- Hi, I used the following options that allows both driver & executors use a custom log4j.properties, {code} spark-submit \ --driver-java-options "-Dlog4j.configuration=file:///absolute/path/to/conf/log4j.properties" \ --files conf/log4j.properties \ --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \ ... {code} I used a local log4j.properties file for the driver and the file in the distributed cache (by --files option) for the executors. As shown above, the paths to driver and executors are different. I hope it might be useful to the others. Edited 2017-07-26: log4j.configuration needs path in URI format. Therefore, file:// prefix is necessary for driver option. was (Author: priancho): Hi, I used the following options that allows both driver & executors use a custom log4j.properties, {code} spark-submit \ --driver-java-options=-Dlog4j.configuration=file:///absolute/path/to/conf/log4j.properties \ --files conf/log4j.properties \ --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \ ... {code} I used a local log4j.properties file for the driver and the file in the distributed cache (by --files option) for the executors. As shown above, the paths to driver and executors are different. I hope it might be useful to the others. Edited 2017-07-26: log4j.configuration needs path in URI format. Therefore, file:// prefix is necessary for driver option. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.1.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099477#comment-16099477 ] HanCheol Cho edited comment on SPARK-16784 at 7/26/17 8:26 AM: --- Hi, I used the following options that allows both driver & executors use a custom log4j.properties, {code} spark-submit \ --driver-java-options=-Dlog4j.configuration=file:///absolute/path/to/conf/log4j.properties \ --files conf/log4j.properties \ --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \ ... {code} I used a local log4j.properties file for the driver and the file in the distributed cache (by --files option) for the executors. As shown above, the paths to driver and executors are different. I hope it might be useful to the others. Edited 2017-07-26: log4j.configuration needs path in URI format. Therefore, file:// prefix is necessary for driver option. was (Author: priancho): Hi, I used the following options that allows both driver & executors use a custom log4j.properties, {code} spark-submit \ --driver-java-options=-Dlog4j.configuration=conf/log4j.properties \ --files conf/log4j.properties \ --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \ ... {code} I used a local log4j.properties file for the driver and the file in the distributed cache (by --files option) for the executors. As shown above, the paths to driver and executors are different. I hope it might be useful to the others. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.1.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101298#comment-16101298 ] Sean Owen commented on SPARK-21476: --- [~sagraw] first someone would have to propose a fix > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21412) Reset BufferHolder while initialize an UnsafeRowWriter
[ https://issues.apache.org/jira/browse/SPARK-21412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21412. --- Resolution: Not A Problem > Reset BufferHolder while initialize an UnsafeRowWriter > -- > > Key: SPARK-21412 > URL: https://issues.apache.org/jira/browse/SPARK-21412 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Chenzhao Guo > > UnsafeRowWriter's construtor should contain BufferHolder.reset to make the > writer out of the box. > While writing UnsafeRow using UnsafeRowWriter, developers should manually > call BufferHolder.reset to make the BufferHolder's cursor(which indicates > where to write variable length portion) right in order to write variable > length fields like UTF8String, byte[], etc. > If a developer doesn't reuse the BufferHolder so maybe he never noticed reset > and the comments in code, the UnsafeRow won't be correct if he also writes > variable length UTF8String. This API design doesn't make sense. We should > reset the BufferHolder to make UnsafeRowWriter out of the box, but not let > user read the code and manually call it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101253#comment-16101253 ] Peng Meng commented on SPARK-21476: --- Not each transform uses broadcast, do you have some experiment data shows broadcast is a problem here. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100151#comment-16100151 ] Saurabh Agrawal edited comment on SPARK-21476 at 7/26/17 6:56 AM: -- [~peng.m...@intel.com] I am using it in spark streaming where I give 16 cores to the application. The dataset in each batch has around 100 partitions. The model has 120 trees and is trained with max depth 15. Number of features is around 100. was (Author: sagraw): I am using it in spark streaming where I give 16 cores to the application. The dataset in each batch has around 100 partitions. The model has 120 trees and is trained with max depth 15. Number of features is around 100. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101241#comment-16101241 ] Saurabh Agrawal commented on SPARK-21476: - [~srowen] Can a fix for this go in the next maintenance release for 2.2? > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21445) NotSerializableException thrown by UTF8String.IntWrapper
[ https://issues.apache.org/jira/browse/SPARK-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101224#comment-16101224 ] jin xing commented on SPARK-21445: -- Sorry, I report the exception by mistake. With the change, it works well for me. > NotSerializableException thrown by UTF8String.IntWrapper > > > Key: SPARK-21445 > URL: https://issues.apache.org/jira/browse/SPARK-21445 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.2.1, 2.3.0 > > > {code} > Caused by: java.io.NotSerializableException: > org.apache.spark.unsafe.types.UTF8String$IntWrapper > Serialization stack: > - object not serializable (class: > org.apache.spark.unsafe.types.UTF8String$IntWrapper, value: > org.apache.spark.unsafe.types.UTF8String$IntWrapper@326450e) > - field (class: > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name: > result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper) > - object (class > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, > ) > {code} > Not exactly sure in which specific case this exception is thrown, because I > couldn't come up with a simple reproducer yet, but will include a test in the > PR -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame
Hyukjin Kwon created SPARK-21536: Summary: Remove the workaroud to allow dots in field names in R's createDataFame Key: SPARK-21536 URL: https://issues.apache.org/jira/browse/SPARK-21536 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Priority: Minor We are currently converting dots to underscore in some cases as below in SparkR {code} > createDataFrame(list(list(1)), "a.a") SparkDataFrame[a_a:double] ... In FUN(X[[i]], ...) : Use a_a instead of a.a as column name {code} {code} > createDataFrame(list(list(a.a = 1))) SparkDataFrame[a_a:double] ... In FUN(X[[i]], ...) : Use a_a instead of a.a as column name {code} This looks introduced in the first place due to SPARK-2775 but now it is fixed in SPARK-6898. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org