[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952545#comment-16952545 ] Sungpeo Kook commented on SPARK-29354: -- [~angerszhuuu] I'm talking about `without hadoop` package type. ex) spark-2.4.4-bin-without-hadoop.tgz > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952542#comment-16952542 ] Thomas Graves commented on SPARK-29465: --- Thanks for copying me, yes agree this would be improvement. So my understanding is you want.to restrict.to certain port range or just very specific port? A specific port doesn't really make sense on yarn where you have multiple users. > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching
[ https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-29349: --- Assignee: Juliusz Sompolski > Support FETCH_PRIOR in Thriftserver query results fetching > -- > > Key: SPARK-29349 > URL: https://issues.apache.org/jira/browse/SPARK-29349 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > > Support FETCH_PRIOR fetching in Thriftserver -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching
[ https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-29349. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26014 [https://github.com/apache/spark/pull/26014] > Support FETCH_PRIOR in Thriftserver query results fetching > -- > > Key: SPARK-29349 > URL: https://issues.apache.org/jira/browse/SPARK-29349 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.0.0 > > > Support FETCH_PRIOR fetching in Thriftserver -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29487) Ability to run Spark Kubernetes other than from /opt/spark
Benjamin Miao CAI created SPARK-29487: - Summary: Ability to run Spark Kubernetes other than from /opt/spark Key: SPARK-29487 URL: https://issues.apache.org/jira/browse/SPARK-29487 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Submit Affects Versions: 2.4.4 Reporter: Benjamin Miao CAI On spark kubernetes Dockerfile, the spark binaries are copied to */opt/spark.* If we try to create our own Dockerfile without using */opt/spark* then the image will not run. After looking at the source code, it seem that in various places, the path is hard-coded to */opt/spark* *Example :* Constants.scala : {color:#808080}// Spark app configs for containers {color}{color:#80}val {color}SPARK_CONF_VOLUME = {color:#008000}"spark-conf-volume"{color} *{color:#80}val {color}SPARK_CONF_DIR_INTERNAL = {color:#008000}"/opt/spark/conf"{color}* Is it possible to make this configurable so we can put spark elsewhere than /opt/. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952537#comment-16952537 ] Sandeep Katta commented on SPARK-29465: --- [~dongjoon] thank you for the review. I will raise PR soon > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952535#comment-16952535 ] Vishwas Nalka edited comment on SPARK-29465 at 10/16/19 6:04 AM: - Reopened the issue as Improvement. was (Author: vishwasn): Reopen the issue as Improvement. > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952535#comment-16952535 ] Vishwas Nalka commented on SPARK-29465: --- Reopen the issue as Improvement. > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vishwas Nalka updated SPARK-29465: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vishwas Nalka updated SPARK-29465: -- Issue Type: Improvement (was: Bug) > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vishwas Nalka reopened SPARK-29465: --- > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952526#comment-16952526 ] Dongjoon Hyun commented on SPARK-29465: --- You may reopen this issue as an `Improvement` JIRA with the affected version `3.0.0`, but this is not a bug definitely. > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds
Liu, Linhong created SPARK-29486: Summary: CalendarInterval should have 3 fields: months, days and microseconds Key: SPARK-29486 URL: https://issues.apache.org/jira/browse/SPARK-29486 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: Liu, Linhong Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it to 3 fields: months, days and microseconds. This is because one logical day interval may have different number of microseconds (daylight saving). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952516#comment-16952516 ] Vishwas Nalka edited comment on SPARK-29465 at 10/16/19 5:35 AM: - [~dongjoon], The problem is when the yarn cluster is running in a machine where not all ports can be opened. The requirement is to restrict the ports used by the spark job that is launched in yarn mode. I was able to set all other ports like _"spark.driver.port", "spark.blockManager.port"_ except the UI port. {code:java} // Set the web ui port to be ephemeral for yarn so we don't conflict with // other spark processes running on the same box System.setProperty("spark.ui.port", "0"){code} Can't the above code be modified to include a condition to check if the UI port is already set by the user, if not the port should be set to random as mentioned in the comments. Do let me know your suggestion. Thanks! was (Author: vishwasn): [~dongjoon], The problem is when the yarn cluster is running in a machine where not all ports can be opened. The requirement is to restrict the ports used by the spark job that is launched in yarn mode. I was able to set all other ports like _"spark.driver.port", "spark.blockManager.port"_ except the UI port. // Set the web ui port to be ephemeral for yarn so we don't conflict with// other spark processes running on the same boxSystem.setProperty("spark.ui.port", "0") Can't the above code be modified to include a condition to check if the UI port is already set by the user, if not the port should be set to random as mentioned in the comments. Do let me know your suggestion. Thanks! > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://g
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952516#comment-16952516 ] Vishwas Nalka commented on SPARK-29465: --- [~dongjoon], The problem is when the yarn cluster is running in a machine where not all ports can be opened. The requirement is to restrict the ports used by the spark job that is launched in yarn mode. I was able to set all other ports like _"spark.driver.port", "spark.blockManager.port"_ except the UI port. // Set the web ui port to be ephemeral for yarn so we don't conflict with// other spark processes running on the same boxSystem.setProperty("spark.ui.port", "0") Can't the above code be modified to include a condition to check if the UI port is already set by the user, if not the port should be set to random as mentioned in the comments. Do let me know your suggestion. Thanks! > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27259) Allow setting -1 as split size for InputFileBlock
[ https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27259: - Assignee: Praneet Sharma > Allow setting -1 as split size for InputFileBlock > - > > Key: SPARK-27259 > URL: https://issues.apache.org/jira/browse/SPARK-27259 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Simon poortman >Assignee: Praneet Sharma >Priority: Major > > > From spark 2.2.x versions, when spark job processing any compressed HDFS > files with custom input file format then spark jobs are failing with error > "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot > be negative", the custom input file format will return the number of bytes > length value as -1 for compressed file formats due to the compressed HDFS > file are non splitable, so for compressed input file format the split will be > offset as 0 and number of bytes length as -1, spark should consider the bytes > length value -1 as valid split for the compressed file formats. > > We observed that earlier versions of spark doesn’t have this validation, and > found that from spark 2.2.x new validation got introduced in the class > InputFileBlockHolder, so spark should accept the number of bytes length value > -1 as valid length for input splits from spark 2.2.x as well. > > +Below is the stack trace.+ > Caused by: java.lang.IllegalArgumentException: requirement failed: length > (-1) cannot be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > +Below is the code snippet which caused this issue.+ > ** {color:#ff}require(length >= 0, s"length ($length) cannot be > negative"){color} // This validation caused the issue. > > {code:java} > // code placeholder > org.apache.spark.rdd.InputFileBlockHolder - spark-core > > def set(filePath: String, startOffset: Long, length: Long): Unit = { > require(filePath != null, "filePath cannot be null") > require(startOffset >= 0, s"startOffset ($startOffset) cannot be > negative") > require(length >= 0, s"length ($length) cannot be negative") > inputBlock.set(new FileBlock(UTF8String.fromString(filePath), > startOffset, length)) > } > {code} > > +Steps to reproduce the issue.+ > Please refer the below code to reproduce the issue. > {code:java} > // code placeholder > import org.apache.hadoop.mapred.JobConf > val hadoopConf = new JobConf() > import org.apache.hadoop.mapred.FileInputFormat > import org.apache.hadoop.fs.Path > FileInputFormat.setInputPaths(hadoopConf, new > Path("/output656/part-r-0.gz")) > val records = > sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Writable]) > records.count() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27259) Allow setting -1 as split size for InputFileBlock
[ https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27259: -- Summary: Allow setting -1 as split size for InputFileBlock (was: Processing Compressed HDFS files with spark failing with error: "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative" from spark 2.2.X) > Allow setting -1 as split size for InputFileBlock > - > > Key: SPARK-27259 > URL: https://issues.apache.org/jira/browse/SPARK-27259 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Simon poortman >Priority: Major > > > From spark 2.2.x versions, when spark job processing any compressed HDFS > files with custom input file format then spark jobs are failing with error > "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot > be negative", the custom input file format will return the number of bytes > length value as -1 for compressed file formats due to the compressed HDFS > file are non splitable, so for compressed input file format the split will be > offset as 0 and number of bytes length as -1, spark should consider the bytes > length value -1 as valid split for the compressed file formats. > > We observed that earlier versions of spark doesn’t have this validation, and > found that from spark 2.2.x new validation got introduced in the class > InputFileBlockHolder, so spark should accept the number of bytes length value > -1 as valid length for input splits from spark 2.2.x as well. > > +Below is the stack trace.+ > Caused by: java.lang.IllegalArgumentException: requirement failed: length > (-1) cannot be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > +Below is the code snippet which caused this issue.+ > ** {color:#ff}require(length >= 0, s"length ($length) cannot be > negative"){color} // This validation caused the issue. > > {code:java} > // code placeholder > org.apache.spark.rdd.InputFileBlockHolder - spark-core > > def set(filePath: String, startOffset: Long, length: Long): Unit = { > require(filePath != null, "filePath cannot be null") > require(startOffset >= 0, s"startOffset ($startOffset) cannot be > negative") > require(length >= 0, s"length ($length) cannot be negative") > inputBlock.set(new FileBlock(UTF8String.fromString(filePath), > startOffset, length)) > } > {code} > > +Steps to reproduce the issue.+ > Please refer the below code to reproduce the issue. > {code:java} > // code placeholder > import org.apache.hadoop.mapred.JobConf > val hadoopConf = new JobConf() > import org.apache.hadoop.mapred.FileInputFormat > import org.apache.hadoop.fs.Path > FileInputFormat.setInputPaths(hadoopConf, new > Path("/output656/part-r-0.gz")) > val records = > sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Writable]) > records.count() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29469) Avoid retries by RetryingBlockFetcher when ExternalBlockStoreClient is closed
[ https://issues.apache.org/jira/browse/SPARK-29469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29469. - Fix Version/s: 3.0.0 Resolution: Fixed > Avoid retries by RetryingBlockFetcher when ExternalBlockStoreClient is closed > - > > Key: SPARK-29469 > URL: https://issues.apache.org/jira/browse/SPARK-29469 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.0.0 > > > Found that some NPE was thrown in job log: > 2019-10-14 20:06:16 ERROR RetryingBlockFetcher:143 - Exception while > beginning fetch of 2 outstanding blocks (after 3 retries) > java.lang.NullPointerException > at > org.apache.spark.network.shuffle.ExternalShuffleClient.lambda$fetchBlocks$0(ExternalShuffleClient.java:100) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > It was happened after BlockManager and ExternalBlockStoreClient was closed > due to previous errors. In this cases, RetryingBlockFetcher does not need to > retry. This NPE is harmless for job execution, but is a source of misleading > when looking at log. Especially for end-users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952509#comment-16952509 ] Rakesh Raushan commented on SPARK-29477: I will raise for this soon. > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh updated SPARK-29477: Comment: was deleted (was: i will check this issue) > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952507#comment-16952507 ] Dongjoon Hyun edited comment on SPARK-29465 at 10/16/19 4:50 AM: - Thank you for filing a JIRA, but I'll close this as `Not A Problem`, [~vishwasn]. In YARN cluster environment, it's not a good idea to specify the port. It's fundamentally anti-pattern. was (Author: dongjoon): Thank you for filing a JIRA, but I'll close this as `Not A Problem`, [~vishwasn]. > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952505#comment-16952505 ] Dongjoon Hyun edited comment on SPARK-29465 at 10/16/19 4:48 AM: - Sorry, but this is trying to revert [SPARK-3627|https://github.com/apache/spark/commit/70e824f750aa8ed446eec104ba158b0503ba58a9]. Apache Spark is designed to ignore it intentionally. {code} // Set the web ui port to be ephemeral for yarn so we don't conflict with // other spark processes running on the same box System.setProperty("spark.ui.port", "0") {code} cc [~tgraves] was (Author: dongjoon): Sorry, but this is trying to revert [SPARK-3627|https://github.com/apache/spark/commit/70e824f750aa8ed446eec104ba158b0503ba58a9]. Apache Spark is designed to ignore it intentionally. cc [~tgraves] > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29465. --- Resolution: Not A Problem Thank you for filing a JIRA, but I'll close this as `Not A Problem`, [~vishwasn]. > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952505#comment-16952505 ] Dongjoon Hyun commented on SPARK-29465: --- Sorry, but this is trying to revert [SPARK-3627|https://github.com/apache/spark/commit/70e824f750aa8ed446eec104ba158b0503ba58a9]. Apache Spark is designed to ignore it intentionally. cc [~tgraves] > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29465: -- Affects Version/s: (was: 1.6.2) > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952495#comment-16952495 ] Dongjoon Hyun commented on SPARK-29423: --- Thank you all. This is merged to master as an Improvement. > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1, 2.4.3, 2.4.4 >Reporter: pin_zhang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29423: -- Affects Version/s: 2.4.4 > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1, 2.4.3, 2.4.4 >Reporter: pin_zhang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29423: -- Issue Type: Improvement (was: Bug) > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1, 2.4.3 >Reporter: pin_zhang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29423. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26089 [https://github.com/apache/spark/pull/26089] > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1, 2.4.3 >Reporter: pin_zhang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29423: - Assignee: Yuming Wang > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1, 2.4.3 >Reporter: pin_zhang >Assignee: Yuming Wang >Priority: Major > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29477: - Description: Active Batches Table and Completed Batches can be re look and put proper tip for Batch Time and Record column Summary: Improve tooltip information for Streaming Tab (was: Improve tooltip information for Streaming Statistics under Streaming Tab) > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29267) rdd.countApprox should stop when 'timeout'
[ https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952461#comment-16952461 ] Kangtian commented on SPARK-29267: -- [~hyukjin.kwon] the screenshots was not committed, it's my solution of `countApprox`. "Also, do you mean {{timeout}} does not affect the finish time of {{countApprox(timeout: Long, confidence: Double = 0.95)}}?" Yes, when timeout comes, the job not stop, but running in background. > rdd.countApprox should stop when 'timeout' > -- > > Key: SPARK-29267 > URL: https://issues.apache.org/jira/browse/SPARK-29267 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kangtian >Priority: Minor > Attachments: image-2019-10-05-12-37-22-927.png, > image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png > > > {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}} > +countApprox(timeout: Long, confidence: Double = 0.95)+ > > But: > when timeout comes, the job will continue run until really finish. > > We Want: > *When timeout comes, the job will finish{color:#FF} immediately{color}*, > without FinalValue > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29485) Improve BooleanSimplification performance
[ https://issues.apache.org/jira/browse/SPARK-29485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952459#comment-16952459 ] Yuming Wang commented on SPARK-29485: - cc [~manifoldQAQ] > Improve BooleanSimplification performance > - > > Key: SPARK-29485 > URL: https://issues.apache.org/jira/browse/SPARK-29485 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:scala} > val columns = (0 until 1000).map{ i => s"id as id$i"} > spark.range(1).selectExpr(columns : _*).write.saveAsTable("t1") > spark.range(1).selectExpr(columns : _*).write.saveAsTable("t2") > RuleExecutor.resetMetrics() > spark.table("t1").join(spark.table("t2"), (1 until 800).map(i => > s"id${i}")).show(false) > logWarning(RuleExecutor.dumpTimeSpent()) > {code} > {noformat} > === Metrics of Analyzer/Optimizer Rules === > Total number of runs: 20157 > Total time: 12.918977054 seconds > Rule > Effective Time / Total Time > Effective Runs / Total Runs > org.apache.spark.sql.catalyst.optimizer.BooleanSimplification > 0 / 9835799647 0 / 3 > > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints > 1532613008 / 1532613008 1 / 1 > > ... > {noformat} > If we disable {{BooleanSimplification}}: > {noformat} > === Metrics of Analyzer/Optimizer Rules === > Total number of runs: 20154 > Total time: 3.715814437 seconds > Rule > Effective Time / Total Time > Effective Runs / Total Runs > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints > 2081338100 / 2081338100 1 / 1 > > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29485) Improve BooleanSimplification performance
Yuming Wang created SPARK-29485: --- Summary: Improve BooleanSimplification performance Key: SPARK-29485 URL: https://issues.apache.org/jira/browse/SPARK-29485 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang How to reproduce: {code:scala} val columns = (0 until 1000).map{ i => s"id as id$i"} spark.range(1).selectExpr(columns : _*).write.saveAsTable("t1") spark.range(1).selectExpr(columns : _*).write.saveAsTable("t2") RuleExecutor.resetMetrics() spark.table("t1").join(spark.table("t2"), (1 until 800).map(i => s"id${i}")).show(false) logWarning(RuleExecutor.dumpTimeSpent()) {code} {noformat} === Metrics of Analyzer/Optimizer Rules === Total number of runs: 20157 Total time: 12.918977054 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 0 / 9835799647 0 / 3 org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints 1532613008 / 1532613008 1 / 1 ... {noformat} If we disable {{BooleanSimplification}}: {noformat} === Metrics of Analyzer/Optimizer Rules === Total number of runs: 20154 Total time: 3.715814437 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints 2081338100 / 2081338100 1 / 1 ... {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression
[ https://issues.apache.org/jira/browse/SPARK-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952399#comment-16952399 ] Huaxin Gao commented on SPARK-29381: [~podongfeng] Sorry, somehow I thought this is a follow up Jira to the previous one and I only need to add _ to the existing XXXParams classes. I will have a follow up Jira to add _LinearSVCParams, _LinearRegressionParams and a few others. Before I work on the last parity jira, I will double check all the ml packages to make sure scala and python are in sync. > Add 'private' _XXXParams classes for classification & regression > > > Key: SPARK-29381 > URL: https://issues.apache.org/jira/browse/SPARK-29381 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] would you like to work on this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951417#comment-16951417 ] Joachim Hereth edited comment on SPARK-24915 at 10/15/19 8:58 PM: -- this is fixed by [https://github.com/apache/spark/pull/26118] As in RDD days Row is considered a tuple while all Rows coming from dataframes have column names and behave more like dicts. Maybe it's time to deprecate RDD-style Rows? was (Author: jhereth): this is fixed by [https://github.com/apache/spark/pull/26118|https://github.com/apache/spark/pull/26118.] As in RDD days Row is considered a tuple while all Rows coming from dataframes have column names and behave more like dicts. Maybe it's time to deprecate RDD-style Rows? > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951417#comment-16951417 ] Joachim Hereth edited comment on SPARK-24915 at 10/15/19 8:58 PM: -- this is fixed by [https://github.com/apache/spark/pull/26118|https://github.com/apache/spark/pull/26118.] As in RDD days Row is considered a tuple while all Rows coming from dataframes have column names and behave more like dicts. Maybe it's time to deprecate RDD-style Rows? was (Author: jhereth): this is fixed by [https://github.com/apache/spark/pull/26118.] As in RDD days Row is considered a tuple while all Rows coming from dataframes have column names and behave more like dicts. Maybe it's time to deprecate RDD-style Rows? > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) ---
[jira] [Updated] (SPARK-24540) Support for multiple character delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-24540: - Priority: Minor (was: Major) > Support for multiple character delimiter in Spark CSV read > -- > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Assignee: Jeff Evans >Priority: Minor > Fix For: 3.0.0 > > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple character delimiters > and presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24540) Support for multiple character delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-24540: Assignee: Jeff Evans > Support for multiple character delimiter in Spark CSV read > -- > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Assignee: Jeff Evans >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple character delimiters > and presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24540) Support for multiple character delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-24540. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26027 [https://github.com/apache/spark/pull/26027] > Support for multiple character delimiter in Spark CSV read > -- > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Assignee: Jeff Evans >Priority: Major > Fix For: 3.0.0 > > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple character delimiters > and presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951417#comment-16951417 ] Joachim Hereth edited comment on SPARK-24915 at 10/15/19 8:25 PM: -- this is fixed by [https://github.com/apache/spark/pull/26118.] As in RDD days Row is considered a tuple while all Rows coming from dataframes have column names and behave more like dicts. Maybe it's time to deprecate RDD-style Rows? was (Author: jhereth): this is fixed by [https://github.com/apache/spark/pull/26118.] It's strange that Row is considered a tuple (it also causes the tests to look a bit strange). However, changing the hierarchy seemed a bit too adventurous. > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Assigned] (SPARK-28947) Status logging occurs on every state change but not at an interval for liveness.
[ https://issues.apache.org/jira/browse/SPARK-28947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-28947: -- Assignee: Kent Yao > Status logging occurs on every state change but not at an interval for > liveness. > > > Key: SPARK-28947 > URL: https://issues.apache.org/jira/browse/SPARK-28947 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.4 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > The start method of `LoggingPodStatusWatcherImpl` should be invoked -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28947) Status logging occurs on every state change but not at an interval for liveness.
[ https://issues.apache.org/jira/browse/SPARK-28947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-28947. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25648 [https://github.com/apache/spark/pull/25648] > Status logging occurs on every state change but not at an interval for > liveness. > > > Key: SPARK-28947 > URL: https://issues.apache.org/jira/browse/SPARK-28947 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.4 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > The start method of `LoggingPodStatusWatcherImpl` should be invoked -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang
[ https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952224#comment-16952224 ] Josh Rosen commented on SPARK-27558: I just ran into this as well: it looks like the problem is that {{UnsafeInMemorySorter.getMemoryUsage}} does not gracefully handle the case where {{array = null}}. I suspect that adding {code:java} if (array != null) { current code } else { 0L } {code} would be a correct, sufficient fix. > NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter > causing tasks to hang > > > Key: SPARK-27558 > URL: https://issues.apache.org/jira/browse/SPARK-27558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.2 >Reporter: Alessandro Bellina >Priority: Major > > We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the > array we are accessing there being null). This looks to be caused by a Spark > OOM when UnsafeInMemorySorter is trying to spill. > This is likely a symptom of > https://issues.apache.org/jira/browse/SPARK-21492. The real question for this > ticket is, could we handle things more gracefully, rather than NPE. For > example: > Remove this: > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182 > so when this fails (and store the new array into a temporary): > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186 > we don't end up with a null "array". This state is causing one of our jobs to > hang infinitely (we think) due to the original allocation error. > Stack trace for reference > {noformat} > 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR > org.apache.spark.TaskContextImpl - Error in TaskCompletionListener > java.lang.NullPointerException > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131) > at > org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129) > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117) > at org.apache.spark.scheduler.Task.run(Task.scala:119) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR > org.apache.spark.executor.Executor - Exception in task 102.0 in stage 28.0 > (TID 46729) > org.apache.spark.util.TaskCompletionListenerException: null > Previous exception in task: Unable to acquire 65536 bytes of memory, got 0 > org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) > > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) > > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186) > > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229) > > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204) > > org.apache.spark.memory.TaskMemoryManager.allocatePage(Tas
[jira] [Resolved] (SPARK-29470) Update plugins to latest versions
[ https://issues.apache.org/jira/browse/SPARK-29470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29470. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26117 [https://github.com/apache/spark/pull/26117] > Update plugins to latest versions > - > > Key: SPARK-29470 > URL: https://issues.apache.org/jira/browse/SPARK-29470 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29470) Update plugins to latest versions
[ https://issues.apache.org/jira/browse/SPARK-29470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29470: - Assignee: Dongjoon Hyun > Update plugins to latest versions > - > > Key: SPARK-29470 > URL: https://issues.apache.org/jira/browse/SPARK-29470 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29484) Very Poor Performance When Reading SQL Server Tables Using Active Directory Authentication
Scott Black created SPARK-29484: --- Summary: Very Poor Performance When Reading SQL Server Tables Using Active Directory Authentication Key: SPARK-29484 URL: https://issues.apache.org/jira/browse/SPARK-29484 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3 Environment: Test Case jars for ms sql client adal4j 1.6.4 oauth2-oidc-sdk 6.16.2 json-smart 2.3 nimbus-jose-jwt 7.9 mssql-jdbc 7.4.0.jre8 Slow JDBC URL using AD "jdbc:sqlserver://:1433;database=;user=;password=;encrypt=true;ServerCertificate=false;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30;authentication=ActiveDirectoryPassword" URL For Expect Performance Using SQL Server Account "jdbc:sqlserver://:1433;database=;user=;password=;encrypt=true;ServerCertificate=false;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30;authentication=sqlPassword" Reporter: Scott Black When creating a dataframe via JDBC from MS SQL Server performance is so bad to be unusable when authentication was performed with Active Directory. When authentication is using SQL Server account performance is as expected. Created a java that that connected to the same Azure SQL database via Active Directory account and performance was the as Sql Server account. When connecting via Active Directory it could 30 minutes to 1 hour to read a 250 row table compared to 5 seconds using SQL Server account or using a console Java app connecting via AD. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29387) Support `*` and `/` operators for intervals
[ https://issues.apache.org/jira/browse/SPARK-29387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29387: --- Summary: Support `*` and `/` operators for intervals (was: Support `*` and `\` operators for intervals) > Support `*` and `/` operators for intervals > --- > > Key: SPARK-29387 > URL: https://issues.apache.org/jira/browse/SPARK-29387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Support `*` by numeric, `/` by numeric. See > [https://www.postgresql.org/docs/12/functions-datetime.html] > ||Operator||Example||Result|| > |*|900 * interval '1 second'|interval '00:15:00'| > |*|21 * interval '1 day'|interval '21 days'| > |/|interval '1 hour' / double precision '1.5'|interval '00:40:00'| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29483) Bump Jackson to 2.10.0
Fokko Driesprong created SPARK-29483: Summary: Bump Jackson to 2.10.0 Key: SPARK-29483 URL: https://issues.apache.org/jira/browse/SPARK-29483 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: Fokko Driesprong Fix For: 3.0.0 Fixes the following CVE's: https://www.cvedetails.com/cve/CVE-2019-16942/ https://www.cvedetails.com/cve/CVE-2019-16943/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27259) Processing Compressed HDFS files with spark failing with error: "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative" from spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952164#comment-16952164 ] Dongjoon Hyun commented on SPARK-27259: --- Hi, [~praneetsharma]. Thank you for making a JIRA, but we cannot reproduce your problem due to the following. {code} com.platform.custom.storagehandler.INFAInputFormat {code} > Processing Compressed HDFS files with spark failing with error: > "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot > be negative" from spark 2.2.X > - > > Key: SPARK-27259 > URL: https://issues.apache.org/jira/browse/SPARK-27259 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Simon poortman >Priority: Major > > > From spark 2.2.x versions, when spark job processing any compressed HDFS > files with custom input file format then spark jobs are failing with error > "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot > be negative", the custom input file format will return the number of bytes > length value as -1 for compressed file formats due to the compressed HDFS > file are non splitable, so for compressed input file format the split will be > offset as 0 and number of bytes length as -1, spark should consider the bytes > length value -1 as valid split for the compressed file formats. > > We observed that earlier versions of spark doesn’t have this validation, and > found that from spark 2.2.x new validation got introduced in the class > InputFileBlockHolder, so spark should accept the number of bytes length value > -1 as valid length for input splits from spark 2.2.x as well. > > +Below is the stack trace.+ > Caused by: java.lang.IllegalArgumentException: requirement failed: length > (-1) cannot be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > +Below is the code snippet which caused this issue.+ > ** {color:#ff}require(length >= 0, s"length ($length) cannot be > negative"){color} // This validation caused the issue. > > {code:java} > // code placeholder > org.apache.spark.rdd.InputFileBlockHolder - spark-core > > def set(filePath: String, startOffset: Long, length: Long): Unit = { > require(filePath != null, "filePath cannot be null") > require(startOffset >= 0, s"startOffset ($startOffset) cannot be > negative") > require(length >= 0, s"length ($length) cannot be negative") > inputBlock.set(new FileBlock(UTF8String.fromString(filePath), > startOffset, length)) > } > {code} > > +Steps to reproduce the issue.+ > Please refer the below code to reproduce the issue. > {code:java} > // code placeholder > import org.apache.hadoop.mapred.JobConf > val hadoopConf = new JobConf() > import org.apache.hadoop.mapred.FileInputFormat > import org.apache.hadoop.fs.Path > FileInputFormat.setInputPaths(hadoopConf, new > Path("/output656/part-r-0.gz")) > val records = > sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Writable]) > records.count() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29182) Cache preferred locations of checkpointed RDD
[ https://issues.apache.org/jira/browse/SPARK-29182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-29182. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25856 [https://github.com/apache/spark/pull/25856] > Cache preferred locations of checkpointed RDD > - > > Key: SPARK-29182 > URL: https://issues.apache.org/jira/browse/SPARK-29182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28885) Follow ANSI store assignment rules in table insertion by default
[ https://issues.apache.org/jira/browse/SPARK-28885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28885. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26107 [https://github.com/apache/spark/pull/26107] > Follow ANSI store assignment rules in table insertion by default > > > Key: SPARK-28885 > URL: https://issues.apache.org/jira/browse/SPARK-28885 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > When inserting a value into a column with the different data type, Spark > performs type coercion. Currently, we support 3 policies for the store > assignment rules: ANSI, legacy and strict, which can be set via the option > "spark.sql.storeAssignmentPolicy": > 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the > behavior is mostly the same as PostgreSQL. It disallows certain unreasonable > type conversions such as converting `string` to `int` and `double` to > `boolean`. It will throw a runtime exception if the value is > out-of-range(overflow). > 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, > which is very loose. E.g., converting either `string` to `int` or `double` to > `boolean` is allowed. It is the current behavior in Spark 2.x for > compatibility with Hive. When inserting an out-of-range value to a integral > field, the low-order bits of the value is inserted(the same as Java/Scala > numeric type casting). For example, if 257 is inserted to a field of Byte > type, the result is 1. > 3. Strict: Spark doesn't allow any possible precision loss or data truncation > in store assignment, e.g., converting either `double` to `int` or `decimal` > to `double` is allowed. The rules are originally for Dataset encoder. As far > as I know, no mainstream DBMS is using this policy by default. > Currently, the V1 data source uses "Legacy" policy by default, while V2 uses > "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 > in Spark 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28885) Follow ANSI store assignment rules in table insertion by default
[ https://issues.apache.org/jira/browse/SPARK-28885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28885: - Assignee: Gengliang Wang > Follow ANSI store assignment rules in table insertion by default > > > Key: SPARK-28885 > URL: https://issues.apache.org/jira/browse/SPARK-28885 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > When inserting a value into a column with the different data type, Spark > performs type coercion. Currently, we support 3 policies for the store > assignment rules: ANSI, legacy and strict, which can be set via the option > "spark.sql.storeAssignmentPolicy": > 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the > behavior is mostly the same as PostgreSQL. It disallows certain unreasonable > type conversions such as converting `string` to `int` and `double` to > `boolean`. It will throw a runtime exception if the value is > out-of-range(overflow). > 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, > which is very loose. E.g., converting either `string` to `int` or `double` to > `boolean` is allowed. It is the current behavior in Spark 2.x for > compatibility with Hive. When inserting an out-of-range value to a integral > field, the low-order bits of the value is inserted(the same as Java/Scala > numeric type casting). For example, if 257 is inserted to a field of Byte > type, the result is 1. > 3. Strict: Spark doesn't allow any possible precision loss or data truncation > in store assignment, e.g., converting either `double` to `int` or `decimal` > to `double` is allowed. The rules are originally for Dataset encoder. As far > as I know, no mainstream DBMS is using this policy by default. > Currently, the V1 data source uses "Legacy" policy by default, while V2 uses > "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 > in Spark 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28211) Shuffle Storage API: Driver Lifecycle
[ https://issues.apache.org/jira/browse/SPARK-28211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-28211: Assignee: Yifei Huang > Shuffle Storage API: Driver Lifecycle > - > > Key: SPARK-28211 > URL: https://issues.apache.org/jira/browse/SPARK-28211 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Assignee: Yifei Huang >Priority: Major > > As part of the shuffle storage API, allow users to hook in application-wide > startup and shutdown methods. This can do things like create tables in the > shuffle storage database, or register / unregister against file servers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28211) Shuffle Storage API: Driver Lifecycle
[ https://issues.apache.org/jira/browse/SPARK-28211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-28211. -- Fix Version/s: 3.0.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/25823 > Shuffle Storage API: Driver Lifecycle > - > > Key: SPARK-28211 > URL: https://issues.apache.org/jira/browse/SPARK-28211 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Assignee: Yifei Huang >Priority: Major > Fix For: 3.0.0 > > > As part of the shuffle storage API, allow users to hook in application-wide > startup and shutdown methods. This can do things like create tables in the > shuffle storage database, or register / unregister against file servers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29481) all the commands should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29481: Description: The newly added v2 commands support multi-catalog and respect the current catalog/namespace. However, it's not true for old v1 commands. This leads to very confusing behaviors, for example {code} USE my_catalog DESC t // success and describe the table t from my_catalog ANALYZE TABLE t // report table not found as there is no table t in the session catalog {code} We should make sure all the commands have the same behavior regarding table resolution was: The newly added v2 commands support multi-catalog and respect the current catalog/namespace. However, it's not true for old v1 commands. This leads to very confusing behaviors, for example {code} USE my_catalog DESC t // success and describe the table t from my_catalog ANALYZE TABLE t // fails as there is no table t in the session catalog {code} We should make sure all the commands have the same behavior regarding table resolution > all the commands should look up catalog/table like v2 commands > -- > > Key: SPARK-29481 > URL: https://issues.apache.org/jira/browse/SPARK-29481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The newly added v2 commands support multi-catalog and respect the current > catalog/namespace. However, it's not true for old v1 commands. > This leads to very confusing behaviors, for example > {code} > USE my_catalog > DESC t // success and describe the table t from my_catalog > ANALYZE TABLE t // report table not found as there is no table t in the > session catalog > {code} > We should make sure all the commands have the same behavior regarding table > resolution -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29481) all the commands should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29481: Summary: all the commands should look up catalog/table like v2 commands (was: all the commands should respect the current catalog/namespace) > all the commands should look up catalog/table like v2 commands > -- > > Key: SPARK-29481 > URL: https://issues.apache.org/jira/browse/SPARK-29481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The newly added v2 commands support multi-catalog and respect the current > catalog/namespace. However, it's not true for old v1 commands. > This leads to very confusing behaviors, for example > {code} > USE my_catalog > DESC t // success and describe the table t from my_catalog > ANALYZE TABLE t // fails as there is no table t in the session catalog > {code} > We should make sure all the commands have the same behavior regarding table > resolution -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29482) ANALYZE TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29482: Summary: ANALYZE TABLE should look up catalog/table like v2 commands (was: ANALYZE TABLE should respect current catalog/namespace) > ANALYZE TABLE should look up catalog/table like v2 commands > > > Key: SPARK-29482 > URL: https://issues.apache.org/jira/browse/SPARK-29482 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29482) ANALYZE TABLE should respect current catalog/namespace
Wenchen Fan created SPARK-29482: --- Summary: ANALYZE TABLE should respect current catalog/namespace Key: SPARK-29482 URL: https://issues.apache.org/jira/browse/SPARK-29482 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29481) all the commands should respect the current catalog/namespace
Wenchen Fan created SPARK-29481: --- Summary: all the commands should respect the current catalog/namespace Key: SPARK-29481 URL: https://issues.apache.org/jira/browse/SPARK-29481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan The newly added v2 commands support multi-catalog and respect the current catalog/namespace. However, it's not true for old v1 commands. This leads to very confusing behaviors, for example {code} USE my_catalog DESC t // success and describe the table t from my_catalog ANALYZE TABLE t // fails as there is no table t in the session catalog {code} We should make sure all the commands have the same behavior regarding table resolution -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29414) HasOutputCol param isSet() property is not preserved after persistence
[ https://issues.apache.org/jira/browse/SPARK-29414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952075#comment-16952075 ] Huaxin Gao commented on SPARK-29414: I somehow can't reproduce the problem. In my test, isSet returns false for loaded_model. Could you please try 2.4? > HasOutputCol param isSet() property is not preserved after persistence > -- > > Key: SPARK-29414 > URL: https://issues.apache.org/jira/browse/SPARK-29414 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.2 >Reporter: Borys Biletskyy >Priority: Major > > HasOutputCol param isSet() property is not preserved after saving and loading > using DefaultParamsReadable and DefaultParamsWritable. > {code:java} > import pytest > from pyspark import keyword_only > from pyspark.ml import Model > from pyspark.sql import DataFrame > from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable > from pyspark.ml.param.shared import HasInputCol, HasOutputCol > from pyspark.sql.functions import * > class HasOutputColTester(Model, > HasInputCol, > HasOutputCol, > DefaultParamsReadable, > DefaultParamsWritable > ): > @keyword_only > def __init__(self, inputCol: str = None, outputCol: str = None): > super(HasOutputColTester, self).__init__() > kwargs = self._input_kwargs > self.setParams(**kwargs) > @keyword_only > def setParams(self, inputCol: str = None, outputCol: str = None): > kwargs = self._input_kwargs > self._set(**kwargs) > return self > def _transform(self, data: DataFrame) -> DataFrame: > return data > class TestHasInputColParam(object): > def test_persist_input_col_set(self, spark, temp_dir): > path = temp_dir + '/test_model' > model = HasOutputColTester() > assert not model.isDefined(model.inputCol) > assert not model.isSet(model.inputCol) > assert model.isDefined(model.outputCol) > assert not model.isSet(model.outputCol) > model.write().overwrite().save(path) > loaded_model: HasOutputColTester = HasOutputColTester.load(path) > assert not loaded_model.isDefined(model.inputCol) > assert not loaded_model.isSet(model.inputCol) > assert loaded_model.isDefined(model.outputCol) > assert not loaded_model.isSet(model.outputCol) # AssertionError: > assert not True > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29480) Ensure that each input row refers to a single relationship type
Martin Junghanns created SPARK-29480: Summary: Ensure that each input row refers to a single relationship type Key: SPARK-29480 URL: https://issues.apache.org/jira/browse/SPARK-29480 Project: Spark Issue Type: New Feature Components: Graph Affects Versions: 3.0.0 Reporter: Martin Junghanns As pointed out in https://github.com/apache/spark/pull/24851/files#r334704485 we need to make sure that no row in an input Dataset representing different relationships types has multiple rel type columns set to {{true}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29479) Design more fluent API for Cypher querying
Martin Junghanns created SPARK-29479: Summary: Design more fluent API for Cypher querying Key: SPARK-29479 URL: https://issues.apache.org/jira/browse/SPARK-29479 Project: Spark Issue Type: New Feature Components: Graph Affects Versions: 3.0.0 Reporter: Martin Junghanns As pointed out in https://github.com/apache/spark/pull/24851/files#r334614538 we would like to add a more fluent API for executing and parameterizing Cypher queries. This issue is supposed to track that progress. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28560. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25295 [https://github.com/apache/spark/pull/25295] > Optimize shuffle reader to local shuffle reader when smj converted to bhj in > adaptive execution > --- > > Key: SPARK-28560 > URL: https://issues.apache.org/jira/browse/SPARK-28560 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to optimize the shuffle reader to local > shuffle reader when smj is converted to bhj in adaptive execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28560: --- Assignee: Ke Jia > Optimize shuffle reader to local shuffle reader when smj converted to bhj in > adaptive execution > --- > > Key: SPARK-28560 > URL: https://issues.apache.org/jira/browse/SPARK-28560 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to optimize the shuffle reader to local > shuffle reader when smj is converted to bhj in adaptive execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
[ https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951909#comment-16951909 ] Sean R. Owen commented on SPARK-27981: -- I'm not sure there is a workaround possible for this one anytime soon. It's just a warning. I don't think we'd be able to resolve it. > Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` > -- > > Key: SPARK-27981 > URL: https://issues.apache.org/jira/browse/SPARK-27981 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This PR aims to remove the following warnings for `java.nio.Bits.unaligned` > at JDK9/10/11/12. Please note that there are more warnings which is beyond of > this PR's scope. > {code} > bin/spark-shell --driver-java-options=--illegal-access=warn > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar) > to method java.nio.Bits.unaligned() > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29473) move statement logical plans to a new file
[ https://issues.apache.org/jira/browse/SPARK-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29473. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26125 [https://github.com/apache/spark/pull/26125] > move statement logical plans to a new file > -- > > Key: SPARK-29473 > URL: https://issues.apache.org/jira/browse/SPARK-29473 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29478) Improve tooltip information for AggregatedLogs Tab
[ https://issues.apache.org/jira/browse/SPARK-29478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951862#comment-16951862 ] Ankit Raj Boudh commented on SPARK-29478: - I will check this issue > Improve tooltip information for AggregatedLogs Tab > -- > > Key: SPARK-29478 > URL: https://issues.apache.org/jira/browse/SPARK-29478 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29478) Improve tooltip information for AggregatedLogs Tab
ABHISHEK KUMAR GUPTA created SPARK-29478: Summary: Improve tooltip information for AggregatedLogs Tab Key: SPARK-29478 URL: https://issues.apache.org/jira/browse/SPARK-29478 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Statistics under Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951858#comment-16951858 ] Ankit Raj Boudh commented on SPARK-29477: - i will check this issue > Improve tooltip information for Streaming Statistics under Streaming Tab > > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29477) Improve tooltip information for Streaming Statistics under Streaming Tab
ABHISHEK KUMAR GUPTA created SPARK-29477: Summary: Improve tooltip information for Streaming Statistics under Streaming Tab Key: SPARK-29477 URL: https://issues.apache.org/jira/browse/SPARK-29477 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component
[ https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mob-ai updated SPARK-29224: --- Attachment: url_loss.xlsx > Implement Factorization Machines as a ml-pipeline component > --- > > Key: SPARK-29224 > URL: https://issues.apache.org/jira/browse/SPARK-29224 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: mob-ai >Priority: Major > Attachments: url_loss.xlsx > > > Factorization Machines is widely used in advertising and recommendation > system to estimate CTR(click-through rate). > Advertising and recommendation system usually has a lot of data, so we need > Spark to estimate the CTR, and Factorization Machines are common ml model to > estimate CTR. > Goal: Implement Factorization Machines as a ml-pipeline component > Requirements: > 1. loss function supports: logloss, mse > 2. optimizer: mini batch SGD > References: > 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International > Conference on Data Mining (ICDM), pp. 995–1000, 2010. > https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component
[ https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mob-ai updated SPARK-29224: --- Attachment: (was: url_loss.xlsx) > Implement Factorization Machines as a ml-pipeline component > --- > > Key: SPARK-29224 > URL: https://issues.apache.org/jira/browse/SPARK-29224 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: mob-ai >Priority: Major > Attachments: url_loss.xlsx > > > Factorization Machines is widely used in advertising and recommendation > system to estimate CTR(click-through rate). > Advertising and recommendation system usually has a lot of data, so we need > Spark to estimate the CTR, and Factorization Machines are common ml model to > estimate CTR. > Goal: Implement Factorization Machines as a ml-pipeline component > Requirements: > 1. loss function supports: logloss, mse > 2. optimizer: mini batch SGD > References: > 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International > Conference on Data Mining (ICDM), pp. 995–1000, 2010. > https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab
[ https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951839#comment-16951839 ] pavithra ramachandran commented on SPARK-29476: --- i shall work on this > Add tooltip information for Thread Dump links and Thread details table > columns in Executors Tab > --- > > Key: SPARK-29476 > URL: https://issues.apache.org/jira/browse/SPARK-29476 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab
jobit mathew created SPARK-29476: Summary: Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab Key: SPARK-29476 URL: https://issues.apache.org/jira/browse/SPARK-29476 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.0.0 Reporter: jobit mathew -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29475) Executors aren't marked as lost on OutOfMemoryError in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-29475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikita Gorbachevski updated SPARK-29475: Summary: Executors aren't marked as lost on OutOfMemoryError in YARN mode (was: Executors aren't marked as dead on OutOfMemoryError in YARN mode) > Executors aren't marked as lost on OutOfMemoryError in YARN mode > > > Key: SPARK-29475 > URL: https://issues.apache.org/jira/browse/SPARK-29475 > Project: Spark > Issue Type: Bug > Components: DStreams, Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Nikita Gorbachevski >Priority: Major > > My spark-streaming application runs in yarn cluster mode with > ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of > batches were scheduled and application did not make any progress. > Thread dump showed that all the streaming threads are blocked, infinitely > waiting for result from executor on > ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. > {code:java} > "streaming-job-executor-11" - Thread t@324 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <7fd76f11> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) > at > org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code} > My tasks are short running and pretty simple, e.g. read raw data from Kafka, > normalize it and write back to Kafka. > After logs analysis i noticed that there were lots of ``RpcTimeoutException`` > on both executor > {code:java} > driver-heartbeater WARN executor.Executor: Issue communicating with driver in > heartbeater > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > and driver sides during context clearing. > {code:java} > WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive > any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout > WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive > any reply in 120 seconds. This timeout is controlled by > spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive > any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is > controlled by spark.rpc.askTimeout at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > Also the only error on executors was > {code:java} > SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL > TERM > {code} > no exceptions at all. > After digging into YARN logs i noticed that executors were killed by AM > because of high memory usage. > Also there were no any logs on driver side saying that executors were lost. > Thus seems that driver wasn't notified about this. > Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend > which logs such message. ``exitExecutor`` method looks similar but it's > message should look differently. > However i believe that driver is notified that executor is lost via async rpc > call, but if executor encounters issues with rpc because of high GC pressure > this message won't be send. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h.
[jira] [Updated] (SPARK-29475) Executors aren't marked as dead on OutOfMemoryError in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-29475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikita Gorbachevski updated SPARK-29475: Description: My spark-streaming application runs in yarn cluster mode with ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of batches were scheduled and application did not make any progress. Thread dump showed that all the streaming threads are blocked, infinitely waiting for result from executor on ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. {code:java} "streaming-job-executor-11" - Thread t@324 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <7fd76f11> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code} My tasks are short running and pretty simple, e.g. read raw data from Kafka, normalize it and write back to Kafka. After logs analysis i noticed that there were lots of ``RpcTimeoutException`` on both executor {code:java} driver-heartbeater WARN executor.Executor: Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) {code} and driver sides during context clearing. {code:java} WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) {code} Also the only error on executors was {code:java} SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM {code} no exceptions at all. After digging into YARN logs i noticed that executors were killed by AM because of high memory usage. Also there were no any logs on driver side saying that executors were lost. Thus seems that driver wasn't notified about this. Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which logs such message. ``exitExecutor`` method looks similar but it's message should look differently. However i believe that driver is notified that executor is lost via async rpc call, but if executor encounters issues with rpc because of high GC pressure this message won't be send. was: My spark-streaming application runs in yarn cluster mode with ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of batches were scheduled and application did not make any progress. Thread dump showed that all the streaming threads are blocked, infinitely waiting for result from executor on ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. {code:java} "streaming-job-executor-11" - Thread t@324 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <7fd76f11> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
[jira] [Created] (SPARK-29475) Executors aren't marked as dead on OutOfMemoryError in YARN mode
Nikita Gorbachevski created SPARK-29475: --- Summary: Executors aren't marked as dead on OutOfMemoryError in YARN mode Key: SPARK-29475 URL: https://issues.apache.org/jira/browse/SPARK-29475 Project: Spark Issue Type: Bug Components: DStreams, Spark Core, YARN Affects Versions: 2.3.0 Reporter: Nikita Gorbachevski My spark-streaming application runs in yarn cluster mode with ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of batches were scheduled and application did not make any progress. Thread dump showed that all the streaming threads are blocked, infinitely waiting for result from executor on ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. {code:java} "streaming-job-executor-11" - Thread t@324 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <7fd76f11> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code} My tasks are short running and pretty simple, e.g. read raw data from Kafka, normalize it and write back to Kafka. After logs analysis i noticed that there were lots of ``RpcTimeoutException`` on both executor {code:java} driver-heartbeater WARN executor.Executor: Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) {code} and driver sides during context clearing. {code:java} WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) {code} Also the only error on executors was {code:java} SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM {code} no exceptions at all. After digging into YARN logs i noticed that executors were killed by AM because of high memory usage. Also there were no any logs on driver side saying that executors were lost. Thus seems that driver wasn't notified about this. Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which logs such message. ``exitExecutor`` method looks similar but it's message should look differently. However i believe that driver is notified that executor is lost via async rpc call, but if executor encounters issues with rpc because of high GC pressure this message won't be send. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29474) CLI support for Spark-on-Docker-on-Yarn
Adam Antal created SPARK-29474: -- Summary: CLI support for Spark-on-Docker-on-Yarn Key: SPARK-29474 URL: https://issues.apache.org/jira/browse/SPARK-29474 Project: Spark Issue Type: Improvement Components: Spark Shell, YARN Affects Versions: 2.4.4 Reporter: Adam Antal The Docker-on-Yarn feature is stable for a while now in Hadoop. One can run Spark on Docker using the Docker-on-Yarn feature by providing runtime environments to the Spark AM and Executor containers similar to this: {noformat} --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro" --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro" {noformat} This is not very user friendly. I suggest to add CLI options to specify: - whether docker image should be used ({{--docker}}) - which docker image should be used ({{--docker-image}}) - what docker mounts should be used ({{--docker-mounts}}) for the AM and executor containers separately. Let's discuss! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29428) Can't persist/set None-valued param
[ https://issues.apache.org/jira/browse/SPARK-29428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951747#comment-16951747 ] Borys Biletskyy edited comment on SPARK-29428 at 10/15/19 9:12 AM: --- [~bryanc], I tried to implement some logic depending on the default/non-default value of parameters. I used isSet(...) method to understand if the parameter was set, until I found out that that after persistence isSet(...) was not preserved (https://issues.apache.org/jira/browse/SPARK-29414). So I decided to use None values as a workaround to distinguish default from non-default param values. It seemed to work, until I tried to persist it. Having None params doesn't seem to be a good idea to me either, especially when there are isSet(...) and isDefined(...) methods. was (Author: borys.biletskyy): [~bryanc], I tried to implement some logic depending on the default/non-default value of parameters. I used isSet(...) method to understand if the parameter was set, until I found out that that after persistence isSet(...) was not preserved (https://issues.apache.org/jira/browse/SPARK-29414). So I decided to use None values as a workaround to distinguish default from non-default param values. It seemed to work, until I tried to persist it. > Can't persist/set None-valued param > > > Key: SPARK-29428 > URL: https://issues.apache.org/jira/browse/SPARK-29428 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.2 >Reporter: Borys Biletskyy >Priority: Major > > {code:java} > import pytest > from pyspark import keyword_only > from pyspark.ml import Model > from pyspark.sql import DataFrame > from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable > from pyspark.ml.param.shared import HasInputCol > from pyspark.sql.functions import * > class NoneParamTester(Model, > HasInputCol, > DefaultParamsReadable, > DefaultParamsWritable > ): > @keyword_only > def __init__(self, inputCol: str = None): > super(NoneParamTester, self).__init__() > kwargs = self._input_kwargs > self.setParams(**kwargs) > @keyword_only > def setParams(self, inputCol: str = None): > kwargs = self._input_kwargs > self._set(**kwargs) > return self > def _transform(self, data: DataFrame) -> DataFrame: > return data > class TestNoneParam(object): > def test_persist_none(self, spark, temp_dir): > path = temp_dir + '/test_model' > model = NoneParamTester(inputCol=None) > assert model.isDefined(model.inputCol) > assert model.isSet(model.inputCol) > assert model.getInputCol() is None > model.write().overwrite().save(path) > NoneParamTester.load(path) # TypeError: Could not convert 'NoneType'> to string type > def test_set_none(self, spark): > model = NoneParamTester(inputCol=None) > assert model.isDefined(model.inputCol) > assert model.isSet(model.inputCol) > assert model.getInputCol() is None > model.set(model.inputCol, None) # TypeError: Could not convert > to string type > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29428) Can't persist/set None-valued param
[ https://issues.apache.org/jira/browse/SPARK-29428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951747#comment-16951747 ] Borys Biletskyy commented on SPARK-29428: - [~bryanc], I tried to implement some logic depending on the default/non-default value of parameters. I used isSet(...) method to understand if the parameter was set, until I found out that that after persistence isSet(...) was not preserved (https://issues.apache.org/jira/browse/SPARK-29414). So I decided to use None values as a workaround to distinguish default from non-default param values. It seemed to work, until I tried to persist it. > Can't persist/set None-valued param > > > Key: SPARK-29428 > URL: https://issues.apache.org/jira/browse/SPARK-29428 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.2 >Reporter: Borys Biletskyy >Priority: Major > > {code:java} > import pytest > from pyspark import keyword_only > from pyspark.ml import Model > from pyspark.sql import DataFrame > from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable > from pyspark.ml.param.shared import HasInputCol > from pyspark.sql.functions import * > class NoneParamTester(Model, > HasInputCol, > DefaultParamsReadable, > DefaultParamsWritable > ): > @keyword_only > def __init__(self, inputCol: str = None): > super(NoneParamTester, self).__init__() > kwargs = self._input_kwargs > self.setParams(**kwargs) > @keyword_only > def setParams(self, inputCol: str = None): > kwargs = self._input_kwargs > self._set(**kwargs) > return self > def _transform(self, data: DataFrame) -> DataFrame: > return data > class TestNoneParam(object): > def test_persist_none(self, spark, temp_dir): > path = temp_dir + '/test_model' > model = NoneParamTester(inputCol=None) > assert model.isDefined(model.inputCol) > assert model.isSet(model.inputCol) > assert model.getInputCol() is None > model.write().overwrite().save(path) > NoneParamTester.load(path) # TypeError: Could not convert 'NoneType'> to string type > def test_set_none(self, spark): > model = NoneParamTester(inputCol=None) > assert model.isDefined(model.inputCol) > assert model.isSet(model.inputCol) > assert model.getInputCol() is None > model.set(model.inputCol, None) # TypeError: Could not convert > to string type > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18748) UDF multiple evaluations causes very poor performance
[ https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951672#comment-16951672 ] Enrico Minack edited comment on SPARK-18748 at 10/15/19 7:16 AM: - I think the behaviour of {{yourUdf.asNondeterministic()}} is exactly what you want in this situation. It is not just a work-around, it is the right way to make Spark call the udf exactly once per row. For [~hqb1989], this is the perfect solution for you as your udf actually is non-deterministic. The Analyzer avoids calling the method multiple times for the same row because it thinks it does not produce the same result for the same input. For +deterministic+ but expensive udfs, this produces the desired behaviour, but it is counterintuitive to call them +non-deterministic.+ Maybe there should also be an {{asExpensive()}} method to flag a udf as expensive, so the analyzer / optimizer does exactly what it currently does for non-deterministic udfs. was (Author: enricomi): I think the behaviour of {{asNondeterministic()}} is exactly what you want in this situation. It is not just a work-around, it is the right way to make Spark call the udf exactly once per row. For [~hqb1989], this is the perfect solution for you as your udf actually is non-deterministic. The Analyzer avoids calling the method multiple times for the same row because it thinks it does not produce the same result for the same input. For +deterministic+ but expensive udfs, this produces the desired behaviour, but it is counterintuitive to call them +non-deterministic.+ Maybe there should also be an {{asExpensive()}} method to flag a udf as expensive, so the analyzer / optimizer does exactly what it currently does for non-deterministic udfs. > UDF multiple evaluations causes very poor performance > - > > Key: SPARK-18748 > URL: https://issues.apache.org/jira/browse/SPARK-18748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Ohad Raviv >Priority: Major > > We have a use case where we have a relatively expensive UDF that needs to be > calculated. The problem is that instead of being calculated once, it gets > calculated over and over again. > for example: > {quote} > def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\} > hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _) > hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is > not null and c<>''").show > {quote} > with the output: > {quote} > blahblah1 > blahblah1 > blahblah1 > +---+ > | c| > +---+ > |nothing| > +---+ > {quote} > You can see that for each reference of column "c" you will get the println. > that causes very poor performance for our real use case. > This also came out on StackOverflow: > http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns > http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/ > with two problematic work-arounds: > 1. cache() after the first time. e.g. > {quote} > hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not > null and c<>''").show > {quote} > while it works, in our case we can't do that because the table is too big to > cache. > 2. move back and forth to rdd: > {quote} > val df = hiveContext.sql("select veryExpensiveCalc('a') as c") > hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and > c<>''").show > {quote} > which works but then we loose some of the optimizations like push down > predicate features, etc. and its very ugly. > Any ideas on how we can make the UDF get calculated just once in a reasonable > way? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18748) UDF multiple evaluations causes very poor performance
[ https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951672#comment-16951672 ] Enrico Minack commented on SPARK-18748: --- I think the behaviour of {{asNondeterministic()}} is exactly what you want in this situation. It is not just a work-around, it is the right way to make Spark call the udf exactly once per row. For [~hqb1989], this is the perfect solution for you as your udf actually is non-deterministic. The Analyzer avoids calling the method multiple times for the same row because it thinks it does not produce the same result for the same input. For +deterministic+ but expensive udfs, this produces the desired behaviour, but it is counterintuitive to call them +non-deterministic.+ Maybe there should also be an {{asExpensive()}} method to flag a udf as expensive, so the analyzer / optimizer does exactly what it currently does for non-deterministic udfs. > UDF multiple evaluations causes very poor performance > - > > Key: SPARK-18748 > URL: https://issues.apache.org/jira/browse/SPARK-18748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Ohad Raviv >Priority: Major > > We have a use case where we have a relatively expensive UDF that needs to be > calculated. The problem is that instead of being calculated once, it gets > calculated over and over again. > for example: > {quote} > def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\} > hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _) > hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is > not null and c<>''").show > {quote} > with the output: > {quote} > blahblah1 > blahblah1 > blahblah1 > +---+ > | c| > +---+ > |nothing| > +---+ > {quote} > You can see that for each reference of column "c" you will get the println. > that causes very poor performance for our real use case. > This also came out on StackOverflow: > http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns > http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/ > with two problematic work-arounds: > 1. cache() after the first time. e.g. > {quote} > hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not > null and c<>''").show > {quote} > while it works, in our case we can't do that because the table is too big to > cache. > 2. move back and forth to rdd: > {quote} > val df = hiveContext.sql("select veryExpensiveCalc('a') as c") > hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and > c<>''").show > {quote} > which works but then we loose some of the optimizations like push down > predicate features, etc. and its very ugly. > Any ideas on how we can make the UDF get calculated just once in a reasonable > way? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org