date:20191015

[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-15 Thread Sungpeo Kook (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952545#comment-16952545
 ] 

Sungpeo Kook commented on SPARK-29354:
--

[~angerszhuuu]
I'm talking about `without hadoop` package type.
ex) spark-2.4.4-bin-without-hadoop.tgz 

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952542#comment-16952542
 ] 

Thomas Graves commented on SPARK-29465:
---

Thanks for copying me, yes agree this would be improvement. So my understanding 
is you want.to restrict.to certain port range or just very specific port?  A 
specific port doesn't really make sense on yarn where you have multiple users.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching

2019-10-15 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-29349:
---

Assignee: Juliusz Sompolski

> Support FETCH_PRIOR in Thriftserver query results fetching
> --
>
> Key: SPARK-29349
> URL: https://issues.apache.org/jira/browse/SPARK-29349
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>
> Support FETCH_PRIOR fetching in Thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching

2019-10-15 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29349.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26014
[https://github.com/apache/spark/pull/26014]

> Support FETCH_PRIOR in Thriftserver query results fetching
> --
>
> Key: SPARK-29349
> URL: https://issues.apache.org/jira/browse/SPARK-29349
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> Support FETCH_PRIOR fetching in Thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29487) Ability to run Spark Kubernetes other than from /opt/spark

2019-10-15 Thread Benjamin Miao CAI (Jira)

Benjamin Miao CAI created SPARK-29487:
-

 Summary: Ability to run Spark Kubernetes other than from /opt/spark
 Key: SPARK-29487
 URL: https://issues.apache.org/jira/browse/SPARK-29487
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Submit
Affects Versions: 2.4.4
Reporter: Benjamin Miao CAI


On spark kubernetes Dockerfile, the spark binaries are copied to */opt/spark.* 

If we try to create our own Dockerfile without using */opt/spark* then the 
image will not run.

After looking at the source code, it seem that in various places, the path is 
hard-coded to */opt/spark*

*Example :*

Constants.scala :

{color:#808080}// Spark app configs for containers
 {color}{color:#80}val {color}SPARK_CONF_VOLUME = 
{color:#008000}"spark-conf-volume"{color}
 *{color:#80}val {color}SPARK_CONF_DIR_INTERNAL = 
{color:#008000}"/opt/spark/conf"{color}*

 

Is it possible to make this configurable so we can put spark elsewhere than 
/opt/.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Sandeep Katta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952537#comment-16952537
 ] 

Sandeep Katta commented on SPARK-29465:
---

[~dongjoon]  thank you for the review. I will raise PR soon

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Vishwas Nalka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952535#comment-16952535
 ] 

Vishwas Nalka edited comment on SPARK-29465 at 10/16/19 6:04 AM:
-

Reopened the issue as Improvement.


was (Author: vishwasn):
Reopen the issue as Improvement.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Vishwas Nalka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952535#comment-16952535
 ] 

Vishwas Nalka commented on SPARK-29465:
---

Reopen the issue as Improvement.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Vishwas Nalka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishwas Nalka updated SPARK-29465:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Vishwas Nalka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishwas Nalka updated SPARK-29465:
--
Issue Type: Improvement  (was: Bug)

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Vishwas Nalka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishwas Nalka reopened SPARK-29465:
---

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952526#comment-16952526
 ] 

Dongjoon Hyun commented on SPARK-29465:
---

You may reopen this issue as an `Improvement` JIRA with the affected version 
`3.0.0`, but this is not a bug definitely.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds

2019-10-15 Thread Liu, Linhong (Jira)

Liu, Linhong created SPARK-29486:


 Summary: CalendarInterval should have 3 fields: months, days and 
microseconds
 Key: SPARK-29486
 URL: https://issues.apache.org/jira/browse/SPARK-29486
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Liu, Linhong


Current CalendarInterval has 2 fields: months and microseconds. This PR try to 
change it
to 3 fields: months, days and microseconds. This is because one logical day 
interval may
have different number of microseconds (daylight saving).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Vishwas Nalka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952516#comment-16952516
 ] 

Vishwas Nalka edited comment on SPARK-29465 at 10/16/19 5:35 AM:
-

[~dongjoon], The problem is when the yarn cluster is running in a machine where 
not all ports can be opened. The requirement is to restrict the ports used by 
the spark job that is launched in yarn mode. I was able to set all other ports 
like _"spark.driver.port", "spark.blockManager.port"_ except the UI port.
{code:java}
// Set the web ui port to be ephemeral for yarn so we don't conflict with
// other spark processes running on the same box
System.setProperty("spark.ui.port", "0"){code}

 Can't the above code be modified to include a condition to check if the UI 
port is already set by the user, if not the port should be set to random as 
mentioned in the comments.  Do let me know your suggestion.

Thanks!


was (Author: vishwasn):
[~dongjoon], The problem is when the yarn cluster is running in a machine where 
not all ports can be opened. The requirement is to restrict the ports used by 
the spark job that is launched in yarn mode. I was able to set all other ports 
like _"spark.driver.port", "spark.blockManager.port"_ except the UI port.
// Set the web ui port to be ephemeral for yarn so we don't conflict with// 
other spark processes running on the same 
boxSystem.setProperty("spark.ui.port", "0")
Can't the above code be modified to include a condition to check if the UI port 
is already set by the user, if not the port should be set to random as 
mentioned in the comments.  Do let me know your suggestion.

Thanks!

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://g

[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Vishwas Nalka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952516#comment-16952516
 ] 

Vishwas Nalka commented on SPARK-29465:
---

[~dongjoon], The problem is when the yarn cluster is running in a machine where 
not all ports can be opened. The requirement is to restrict the ports used by 
the spark job that is launched in yarn mode. I was able to set all other ports 
like _"spark.driver.port", "spark.blockManager.port"_ except the UI port.
// Set the web ui port to be ephemeral for yarn so we don't conflict with// 
other spark processes running on the same 
boxSystem.setProperty("spark.ui.port", "0")
Can't the above code be modified to include a condition to check if the UI port 
is already set by the user, if not the port should be set to random as 
mentioned in the comments.  Do let me know your suggestion.

Thanks!

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27259) Allow setting -1 as split size for InputFileBlock

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27259:
-

Assignee: Praneet Sharma

> Allow setting -1 as split size for InputFileBlock
> -
>
> Key: SPARK-27259
> URL: https://issues.apache.org/jira/browse/SPARK-27259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Simon poortman
>Assignee: Praneet Sharma
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27259) Allow setting -1 as split size for InputFileBlock

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27259:
--
Summary: Allow setting -1 as split size for InputFileBlock  (was: 
Processing Compressed HDFS files with spark failing with error: 
"java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be 
negative" from spark 2.2.X)

> Allow setting -1 as split size for InputFileBlock
> -
>
> Key: SPARK-27259
> URL: https://issues.apache.org/jira/browse/SPARK-27259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Simon poortman
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29469) Avoid retries by RetryingBlockFetcher when ExternalBlockStoreClient is closed

2019-10-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29469.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Avoid retries by RetryingBlockFetcher when ExternalBlockStoreClient is closed
> -
>
> Key: SPARK-29469
> URL: https://issues.apache.org/jira/browse/SPARK-29469
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> Found that some NPE was thrown in job log:
> 2019-10-14 20:06:16 ERROR RetryingBlockFetcher:143 - Exception while 
> beginning fetch of 2 outstanding blocks (after 3 retries)
> java.lang.NullPointerException
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.lambda$fetchBlocks$0(ExternalShuffleClient.java:100)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> It was happened after BlockManager and ExternalBlockStoreClient was closed 
> due to previous errors. In this cases, RetryingBlockFetcher does not need to 
> retry. This NPE is harmless for job execution, but is a source of misleading 
> when looking at log. Especially for end-users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-10-15 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952509#comment-16952509
 ] 

Rakesh Raushan commented on SPARK-29477:


I will raise for this soon.

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-10-15 Thread Ankit Raj Boudh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-29477:

Comment: was deleted

(was: i will check this issue)

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952507#comment-16952507
 ] 

Dongjoon Hyun edited comment on SPARK-29465 at 10/16/19 4:50 AM:
-

Thank you for filing a JIRA, but I'll close this as `Not A Problem`, 
[~vishwasn].
In YARN cluster environment, it's not a good idea to specify the port. It's 
fundamentally anti-pattern.


was (Author: dongjoon):
Thank you for filing a JIRA, but I'll close this as `Not A Problem`, 
[~vishwasn].

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952505#comment-16952505
 ] 

Dongjoon Hyun edited comment on SPARK-29465 at 10/16/19 4:48 AM:
-

Sorry, but this is trying to revert 
[SPARK-3627|https://github.com/apache/spark/commit/70e824f750aa8ed446eec104ba158b0503ba58a9].
Apache Spark is designed to ignore it intentionally.
{code}
// Set the web ui port to be ephemeral for yarn so we don't conflict with
// other spark processes running on the same box
System.setProperty("spark.ui.port", "0")
{code}

cc [~tgraves]



was (Author: dongjoon):
Sorry, but this is trying to revert 
[SPARK-3627|https://github.com/apache/spark/commit/70e824f750aa8ed446eec104ba158b0503ba58a9].
Apache Spark is designed to ignore it intentionally.

cc [~tgraves]


> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29465.
---
Resolution: Not A Problem

Thank you for filing a JIRA, but I'll close this as `Not A Problem`, 
[~vishwasn].

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952505#comment-16952505
 ] 

Dongjoon Hyun commented on SPARK-29465:
---

Sorry, but this is trying to revert 
[SPARK-3627|https://github.com/apache/spark/commit/70e824f750aa8ed446eec104ba158b0503ba58a9].
Apache Spark is designed to ignore it intentionally.

cc [~tgraves]


> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29465:
--
Affects Version/s: (was: 1.6.2)

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.4
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952495#comment-16952495
 ] 

Dongjoon Hyun commented on SPARK-29423:
---

Thank you all. This is merged to master as an Improvement.

> leak on  org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
> ---
>
> Key: SPARK-29423
> URL: https://issues.apache.org/jira/browse/SPARK-29423
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1, 2.4.3, 2.4.4
>Reporter: pin_zhang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> 1.  start server with start-thriftserver.sh
>  2.  JDBC client connect and disconnect to hiveserver2
>  for (int i = 0; i < 1; i++) {
>Connection conn = 
> DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
>conn.close();
>  }
> 3.  instance of  
> org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
> increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29423:
--
Affects Version/s: 2.4.4

> leak on  org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
> ---
>
> Key: SPARK-29423
> URL: https://issues.apache.org/jira/browse/SPARK-29423
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1, 2.4.3, 2.4.4
>Reporter: pin_zhang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> 1.  start server with start-thriftserver.sh
>  2.  JDBC client connect and disconnect to hiveserver2
>  for (int i = 0; i < 1; i++) {
>Connection conn = 
> DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
>conn.close();
>  }
> 3.  instance of  
> org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
> increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29423:
--
Issue Type: Improvement  (was: Bug)

> leak on  org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
> ---
>
> Key: SPARK-29423
> URL: https://issues.apache.org/jira/browse/SPARK-29423
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1, 2.4.3
>Reporter: pin_zhang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> 1.  start server with start-thriftserver.sh
>  2.  JDBC client connect and disconnect to hiveserver2
>  for (int i = 0; i < 1; i++) {
>Connection conn = 
> DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
>conn.close();
>  }
> 3.  instance of  
> org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
> increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29423.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26089
[https://github.com/apache/spark/pull/26089]

> leak on  org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
> ---
>
> Key: SPARK-29423
> URL: https://issues.apache.org/jira/browse/SPARK-29423
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1, 2.4.3
>Reporter: pin_zhang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> 1.  start server with start-thriftserver.sh
>  2.  JDBC client connect and disconnect to hiveserver2
>  for (int i = 0; i < 1; i++) {
>Connection conn = 
> DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
>conn.close();
>  }
> 3.  instance of  
> org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
> increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29423:
-

Assignee: Yuming Wang

> leak on  org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
> ---
>
> Key: SPARK-29423
> URL: https://issues.apache.org/jira/browse/SPARK-29423
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1, 2.4.3
>Reporter: pin_zhang
>Assignee: Yuming Wang
>Priority: Major
>
> 1.  start server with start-thriftserver.sh
>  2.  JDBC client connect and disconnect to hiveserver2
>  for (int i = 0; i < 1; i++) {
>Connection conn = 
> DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
>conn.close();
>  }
> 3.  instance of  
> org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
> increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-10-15 Thread ABHISHEK KUMAR GUPTA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29477:
-
Description: Active Batches Table and Completed Batches can be re look and 
put proper tip for Batch Time and Record column
Summary: Improve tooltip information for Streaming  Tab  (was: Improve 
tooltip information for Streaming Statistics under Streaming Tab)

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29267) rdd.countApprox should stop when 'timeout'

2019-10-15 Thread Kangtian (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952461#comment-16952461
 ] 

Kangtian commented on SPARK-29267:
--

[~hyukjin.kwon]

the screenshots was not committed, it's my solution of  `countApprox`.

 

"Also, do you mean {{timeout}} does not affect the finish time of 
{{countApprox(timeout: Long, confidence: Double = 0.95)}}?"

Yes, when timeout comes, the job not stop, but running in background.

 

> rdd.countApprox should stop when 'timeout'
> --
>
> Key: SPARK-29267
> URL: https://issues.apache.org/jira/browse/SPARK-29267
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kangtian
>Priority: Minor
> Attachments: image-2019-10-05-12-37-22-927.png, 
> image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png
>
>
> {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}}
> +countApprox(timeout: Long, confidence: Double = 0.95)+
>  
> But: 
> when timeout comes, the job will continue run until really finish.
>  
> We Want:
> *When timeout comes, the job will finish{color:#FF} immediately{color}*, 
> without FinalValue
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29485) Improve BooleanSimplification performance

2019-10-15 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952459#comment-16952459
 ] 

Yuming Wang commented on SPARK-29485:
-

cc [~manifoldQAQ]

> Improve BooleanSimplification performance
> -
>
> Key: SPARK-29485
> URL: https://issues.apache.org/jira/browse/SPARK-29485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> val columns = (0 until 1000).map{ i => s"id as id$i"}
> spark.range(1).selectExpr(columns : _*).write.saveAsTable("t1")
> spark.range(1).selectExpr(columns : _*).write.saveAsTable("t2")
> RuleExecutor.resetMetrics()
> spark.table("t1").join(spark.table("t2"), (1 until 800).map(i => 
> s"id${i}")).show(false)
> logWarning(RuleExecutor.dumpTimeSpent())
> {code}
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs: 20157
> Total time: 12.918977054 seconds
> Rule  
>  Effective Time / Total Time 
> Effective Runs / Total Runs
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  0 / 9835799647  0 / 3
>   
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  1532613008 / 1532613008 1 / 1
>   
> ...
> {noformat}
> If we disable {{BooleanSimplification}}:
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs: 20154
> Total time: 3.715814437 seconds
> Rule  
>  Effective Time / Total Time 
> Effective Runs / Total Runs
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  2081338100 / 2081338100 1 / 1
>   
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29485) Improve BooleanSimplification performance

2019-10-15 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29485:
---

 Summary: Improve BooleanSimplification performance
 Key: SPARK-29485
 URL: https://issues.apache.org/jira/browse/SPARK-29485
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


How to reproduce:
{code:scala}
val columns = (0 until 1000).map{ i => s"id as id$i"}
spark.range(1).selectExpr(columns : _*).write.saveAsTable("t1")
spark.range(1).selectExpr(columns : _*).write.saveAsTable("t2")
RuleExecutor.resetMetrics()
spark.table("t1").join(spark.table("t2"), (1 until 800).map(i => 
s"id${i}")).show(false)
logWarning(RuleExecutor.dumpTimeSpent())
{code}


{noformat}
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 20157
Total time: 12.918977054 seconds

Rule
   Effective Time / Total Time Effective 
Runs / Total Runs

org.apache.spark.sql.catalyst.optimizer.BooleanSimplification   
   0 / 9835799647  0 / 3
  
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints 
   1532613008 / 1532613008 1 / 1
  
...
{noformat}

If we disable {{BooleanSimplification}}:
{noformat}
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 20154
Total time: 3.715814437 seconds

Rule
   Effective Time / Total Time Effective 
Runs / Total Runs

org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints 
   2081338100 / 2081338100 1 / 1
  
...
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression

2019-10-15 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952399#comment-16952399
 ] 

Huaxin Gao commented on SPARK-29381:


[~podongfeng]  Sorry, somehow I thought this is a follow up Jira to the 
previous one and I only need to add _ to the existing XXXParams classes. I will 
have a follow up Jira to add  _LinearSVCParams,  _LinearRegressionParams and a 
few others. 

Before I work on the last parity jira, I will double check all the ml packages 
to make sure scala and python are in sync. 

> Add 'private' _XXXParams classes for classification & regression
> 
>
> Key: SPARK-29381
> URL: https://issues.apache.org/jira/browse/SPARK-29381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao]  would you like to work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2019-10-15 Thread Joachim Hereth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951417#comment-16951417
 ] 

Joachim Hereth edited comment on SPARK-24915 at 10/15/19 8:58 PM:
--

this is fixed by [https://github.com/apache/spark/pull/26118]

As in RDD days Row is considered a tuple while all Rows coming from dataframes 
have column names and behave more like dicts.

Maybe it's time to deprecate RDD-style Rows?


was (Author: jhereth):
this is fixed by 
[https://github.com/apache/spark/pull/26118|https://github.com/apache/spark/pull/26118.]

As in RDD days Row is considered a tuple while all Rows coming from dataframes 
have column names and behave more like dicts.

Maybe it's time to deprecate RDD-style Rows?

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2019-10-15 Thread Joachim Hereth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951417#comment-16951417
 ] 

Joachim Hereth edited comment on SPARK-24915 at 10/15/19 8:58 PM:
--

this is fixed by 
[https://github.com/apache/spark/pull/26118|https://github.com/apache/spark/pull/26118.]

As in RDD days Row is considered a tuple while all Rows coming from dataframes 
have column names and behave more like dicts.

Maybe it's time to deprecate RDD-style Rows?


was (Author: jhereth):
this is fixed by [https://github.com/apache/spark/pull/26118.]

As in RDD days Row is considered a tuple while all Rows coming from dataframes 
have column names and behave more like dicts.

Maybe it's time to deprecate RDD-style Rows?

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Updated] (SPARK-24540) Support for multiple character delimiter in Spark CSV read

2019-10-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-24540:
-
Priority: Minor  (was: Major)

> Support for multiple character delimiter in Spark CSV read
> --
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Assignee: Jeff Evans
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24540) Support for multiple character delimiter in Spark CSV read

2019-10-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-24540:


Assignee: Jeff Evans

> Support for multiple character delimiter in Spark CSV read
> --
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Assignee: Jeff Evans
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24540) Support for multiple character delimiter in Spark CSV read

2019-10-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-24540.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26027
[https://github.com/apache/spark/pull/26027]

> Support for multiple character delimiter in Spark CSV read
> --
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Assignee: Jeff Evans
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2019-10-15 Thread Joachim Hereth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951417#comment-16951417
 ] 

Joachim Hereth edited comment on SPARK-24915 at 10/15/19 8:25 PM:
--

this is fixed by [https://github.com/apache/spark/pull/26118.]

As in RDD days Row is considered a tuple while all Rows coming from dataframes 
have column names and behave more like dicts.

Maybe it's time to deprecate RDD-style Rows?


was (Author: jhereth):
this is fixed by [https://github.com/apache/spark/pull/26118.]

It's strange that Row is considered a tuple (it also causes the tests to look a 
bit strange).

However, changing the hierarchy seemed a bit too adventurous.

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Assigned] (SPARK-28947) Status logging occurs on every state change but not at an interval for liveness.

2019-10-15 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-28947:
--

Assignee: Kent Yao

> Status logging occurs on every state change but not at an interval for 
> liveness.
> 
>
> Key: SPARK-28947
> URL: https://issues.apache.org/jira/browse/SPARK-28947
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.4
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> The start method of `LoggingPodStatusWatcherImpl`  should be invoked



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28947) Status logging occurs on every state change but not at an interval for liveness.

2019-10-15 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-28947.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25648
[https://github.com/apache/spark/pull/25648]

> Status logging occurs on every state change but not at an interval for 
> liveness.
> 
>
> Key: SPARK-28947
> URL: https://issues.apache.org/jira/browse/SPARK-28947
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.4
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> The start method of `LoggingPodStatusWatcherImpl`  should be invoked



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27558) NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang

2019-10-15 Thread Josh Rosen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952224#comment-16952224
 ] 

Josh Rosen commented on SPARK-27558:


I just ran into this as well: it looks like the problem is that 
{{UnsafeInMemorySorter.getMemoryUsage}} does not gracefully handle the case 
where {{array = null}}. I suspect that adding 
{code:java}
if (array != null) {
  current code 
} else {
  0L
} {code}
would be a correct, sufficient fix.

> NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter 
> causing tasks to hang
> 
>
> Key: SPARK-27558
> URL: https://issues.apache.org/jira/browse/SPARK-27558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.2
>Reporter: Alessandro Bellina
>Priority: Major
>
> We see an NPE in the UnsafeInMemorySorter.getMemoryUsage function (due to the 
> array we are accessing there being null). This looks to be caused by a Spark 
> OOM when UnsafeInMemorySorter is trying to spill.
> This is likely a symptom of 
> https://issues.apache.org/jira/browse/SPARK-21492. The real question for this 
> ticket is, could we handle things more gracefully, rather than NPE. For 
> example:
> Remove this:
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L182
> so when this fails (and store the new array into a temporary):
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L186
> we don't end up with a null "array". This state is causing one of our jobs to 
> hang infinitely (we think) due to the original allocation error.
> Stack trace for reference
> {noformat}
> 2019-04-23 08:57:14,989 [Executor task launch worker for task 46729] ERROR 
> org.apache.spark.TaskContextImpl  - Error in TaskCompletionListener
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getMemoryUsage(UnsafeInMemorySorter.java:208)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getMemoryUsage(UnsafeExternalSorter.java:249)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.updatePeakMemoryUsed(UnsafeExternalSorter.java:253)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.freeMemory(UnsafeExternalSorter.java:296)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:328)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.lambda$new$0(UnsafeExternalSorter.java:178)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:118)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:131)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:129)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:129)
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
>   at org.apache.spark.scheduler.Task.run(Task.scala:119)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2019-04-23 08:57:15,069 [Executor task launch worker for task 46729] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 102.0 in stage 28.0 
> (TID 46729)
> org.apache.spark.util.TaskCompletionListenerException: null
> Previous exception in task: Unable to acquire 65536 bytes of memory, got 0
>   org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
>   
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:186)
>   
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:229)
>   
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:204)
>   
> org.apache.spark.memory.TaskMemoryManager.allocatePage(Tas

[jira] [Resolved] (SPARK-29470) Update plugins to latest versions

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29470.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26117
[https://github.com/apache/spark/pull/26117]

> Update plugins to latest versions
> -
>
> Key: SPARK-29470
> URL: https://issues.apache.org/jira/browse/SPARK-29470
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29470) Update plugins to latest versions

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29470:
-

Assignee: Dongjoon Hyun

> Update plugins to latest versions
> -
>
> Key: SPARK-29470
> URL: https://issues.apache.org/jira/browse/SPARK-29470
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29484) Very Poor Performance When Reading SQL Server Tables Using Active Directory Authentication

2019-10-15 Thread Scott Black (Jira)

Scott Black created SPARK-29484:
---

 Summary: Very Poor Performance When Reading SQL Server Tables 
Using Active Directory Authentication
 Key: SPARK-29484
 URL: https://issues.apache.org/jira/browse/SPARK-29484
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
 Environment: Test Case

jars for ms sql client

 

adal4j 1.6.4

oauth2-oidc-sdk 6.16.2

json-smart 2.3

nimbus-jose-jwt 7.9

mssql-jdbc 7.4.0.jre8

 

Slow JDBC URL using AD

"jdbc:sqlserver://:1433;database=;user=;password=;encrypt=true;ServerCertificate=false;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30;authentication=ActiveDirectoryPassword"

 

URL For Expect Performance Using SQL Server Account

 

"jdbc:sqlserver://:1433;database=;user=;password=;encrypt=true;ServerCertificate=false;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30;authentication=sqlPassword"
Reporter: Scott Black


When creating a dataframe via JDBC from MS SQL Server performance is so bad to 
be unusable when authentication was performed with Active Directory. When 
authentication is using SQL Server account performance is as expected. Created 
a java that that connected to the same Azure SQL database via Active Directory 
account and performance was the as Sql Server account.  When connecting via 
Active Directory it could 30 minutes to 1 hour to read a 250 row table compared 
to 5 seconds using SQL Server account or using a console Java app connecting 
via AD.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29387) Support `*` and `/` operators for intervals

2019-10-15 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29387:
---
Summary: Support `*` and `/` operators for intervals  (was: Support `*` and 
`\` operators for intervals)

> Support `*` and `/` operators for intervals
> ---
>
> Key: SPARK-29387
> URL: https://issues.apache.org/jira/browse/SPARK-29387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Support `*` by numeric, `/` by numeric. See 
> [https://www.postgresql.org/docs/12/functions-datetime.html]
> ||Operator||Example||Result||
> |*|900 * interval '1 second'|interval '00:15:00'|
> |*|21 * interval '1 day'|interval '21 days'|
> |/|interval '1 hour' / double precision '1.5'|interval '00:40:00'|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29483) Bump Jackson to 2.10.0

2019-10-15 Thread Fokko Driesprong (Jira)

Fokko Driesprong created SPARK-29483:


 Summary: Bump Jackson to 2.10.0
 Key: SPARK-29483
 URL: https://issues.apache.org/jira/browse/SPARK-29483
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Fokko Driesprong
 Fix For: 3.0.0


Fixes the following CVE's:
https://www.cvedetails.com/cve/CVE-2019-16942/
https://www.cvedetails.com/cve/CVE-2019-16943/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27259) Processing Compressed HDFS files with spark failing with error: "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative" from spark 2.2

2019-10-15 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952164#comment-16952164
 ] 

Dongjoon Hyun commented on SPARK-27259:
---

Hi, [~praneetsharma]. Thank you for making a JIRA, but we cannot reproduce your 
problem due to the following.
{code}
com.platform.custom.storagehandler.INFAInputFormat
{code}

> Processing Compressed HDFS files with spark failing with error: 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative" from spark 2.2.X
> -
>
> Key: SPARK-27259
> URL: https://issues.apache.org/jira/browse/SPARK-27259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Simon poortman
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29182) Cache preferred locations of checkpointed RDD

2019-10-15 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-29182.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25856
[https://github.com/apache/spark/pull/25856]

> Cache preferred locations of checkpointed RDD
> -
>
> Key: SPARK-29182
> URL: https://issues.apache.org/jira/browse/SPARK-29182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> One Spark job in our cluster fits many ALS models in parallel. The fitting 
> goes well, but in next when we union all factors, the union operation is very 
> slow.
> By looking into the driver stack dump, looks like the driver spends a lot of 
> time on computing preferred locations. As we checkpoint training data before 
> fitting ALS, the time is spent on 
> ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS 
> interface to query file status and block locations. As we have big number of 
> partitions derived from the checkpointed RDD,  the union will spend a lot of 
> time on querying the same information.
> This proposes to add a Spark config to control the caching behavior of 
> ReliableCheckpointRDD.getPreferredLocations. If it is enabled, 
> getPreferredLocations will only compute preferred locations once and cache it 
> for late usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28885) Follow ANSI store assignment rules in table insertion by default

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28885.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26107
[https://github.com/apache/spark/pull/26107]

> Follow ANSI store assignment rules in table insertion by default
> 
>
> Key: SPARK-28885
> URL: https://issues.apache.org/jira/browse/SPARK-28885
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> When inserting a value into a column with the different data type, Spark 
> performs type coercion. Currently, we support 3 policies for the store 
> assignment rules: ANSI, legacy and strict, which can be set via the option 
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the 
> behavior is mostly the same as PostgreSQL. It disallows certain unreasonable 
> type conversions such as converting `string` to `int` and `double` to 
> `boolean`. It will throw a runtime exception if the value is 
> out-of-range(overflow). 
> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, 
> which is very loose. E.g., converting either `string` to `int` or `double` to 
> `boolean` is allowed. It is the current behavior in Spark 2.x for 
> compatibility with Hive. When inserting an out-of-range value to a integral 
> field, the low-order bits of the value is inserted(the same as Java/Scala 
> numeric type casting). For example, if 257 is inserted to a field of Byte 
> type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data truncation 
> in store assignment, e.g., converting either `double` to `int` or `decimal` 
> to `double` is allowed. The rules are originally for Dataset encoder. As far 
> as I know, no mainstream DBMS is using this policy by default.
> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses 
> "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 
> in Spark 3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28885) Follow ANSI store assignment rules in table insertion by default

2019-10-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28885:
-

Assignee: Gengliang Wang

> Follow ANSI store assignment rules in table insertion by default
> 
>
> Key: SPARK-28885
> URL: https://issues.apache.org/jira/browse/SPARK-28885
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> When inserting a value into a column with the different data type, Spark 
> performs type coercion. Currently, we support 3 policies for the store 
> assignment rules: ANSI, legacy and strict, which can be set via the option 
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the 
> behavior is mostly the same as PostgreSQL. It disallows certain unreasonable 
> type conversions such as converting `string` to `int` and `double` to 
> `boolean`. It will throw a runtime exception if the value is 
> out-of-range(overflow). 
> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, 
> which is very loose. E.g., converting either `string` to `int` or `double` to 
> `boolean` is allowed. It is the current behavior in Spark 2.x for 
> compatibility with Hive. When inserting an out-of-range value to a integral 
> field, the low-order bits of the value is inserted(the same as Java/Scala 
> numeric type casting). For example, if 257 is inserted to a field of Byte 
> type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data truncation 
> in store assignment, e.g., converting either `double` to `int` or `decimal` 
> to `double` is allowed. The rules are originally for Dataset encoder. As far 
> as I know, no mainstream DBMS is using this policy by default.
> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses 
> "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 
> in Spark 3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28211) Shuffle Storage API: Driver Lifecycle

2019-10-15 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-28211:


Assignee: Yifei Huang

> Shuffle Storage API: Driver Lifecycle
> -
>
> Key: SPARK-28211
> URL: https://issues.apache.org/jira/browse/SPARK-28211
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Yifei Huang
>Priority: Major
>
> As part of the shuffle storage API, allow users to hook in application-wide 
> startup and shutdown methods. This can do things like create tables in the 
> shuffle storage database, or register / unregister against file servers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28211) Shuffle Storage API: Driver Lifecycle

2019-10-15 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-28211.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/25823

> Shuffle Storage API: Driver Lifecycle
> -
>
> Key: SPARK-28211
> URL: https://issues.apache.org/jira/browse/SPARK-28211
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Yifei Huang
>Priority: Major
> Fix For: 3.0.0
>
>
> As part of the shuffle storage API, allow users to hook in application-wide 
> startup and shutdown methods. This can do things like create tables in the 
> shuffle storage database, or register / unregister against file servers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29481) all the commands should look up catalog/table like v2 commands

2019-10-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29481:

Description: 
The newly added v2 commands support multi-catalog and respect the current 
catalog/namespace. However, it's not true for old v1 commands.

This leads to very confusing behaviors, for example
{code}
USE my_catalog
DESC t // success and describe the table t from my_catalog
ANALYZE TABLE t // report table not found as there is no table t in the session 
catalog
{code}

We should make sure all the commands have the same behavior regarding table 
resolution

  was:
The newly added v2 commands support multi-catalog and respect the current 
catalog/namespace. However, it's not true for old v1 commands.

This leads to very confusing behaviors, for example
{code}
USE my_catalog
DESC t // success and describe the table t from my_catalog
ANALYZE TABLE t // fails as there is no table t in the session catalog
{code}

We should make sure all the commands have the same behavior regarding table 
resolution


> all the commands should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29481
> URL: https://issues.apache.org/jira/browse/SPARK-29481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The newly added v2 commands support multi-catalog and respect the current 
> catalog/namespace. However, it's not true for old v1 commands.
> This leads to very confusing behaviors, for example
> {code}
> USE my_catalog
> DESC t // success and describe the table t from my_catalog
> ANALYZE TABLE t // report table not found as there is no table t in the 
> session catalog
> {code}
> We should make sure all the commands have the same behavior regarding table 
> resolution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29481) all the commands should look up catalog/table like v2 commands

2019-10-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29481:

Summary: all the commands should look up catalog/table like v2 commands  
(was: all the commands should respect the current catalog/namespace)

> all the commands should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29481
> URL: https://issues.apache.org/jira/browse/SPARK-29481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The newly added v2 commands support multi-catalog and respect the current 
> catalog/namespace. However, it's not true for old v1 commands.
> This leads to very confusing behaviors, for example
> {code}
> USE my_catalog
> DESC t // success and describe the table t from my_catalog
> ANALYZE TABLE t // fails as there is no table t in the session catalog
> {code}
> We should make sure all the commands have the same behavior regarding table 
> resolution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29482) ANALYZE TABLE should look up catalog/table like v2 commands

2019-10-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29482:

Summary:  ANALYZE TABLE should look up catalog/table like v2 commands  
(was: ANALYZE TABLE should respect current catalog/namespace)

>  ANALYZE TABLE should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29482
> URL: https://issues.apache.org/jira/browse/SPARK-29482
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29482) ANALYZE TABLE should respect current catalog/namespace

2019-10-15 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-29482:
---

 Summary: ANALYZE TABLE should respect current catalog/namespace
 Key: SPARK-29482
 URL: https://issues.apache.org/jira/browse/SPARK-29482
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29481) all the commands should respect the current catalog/namespace

2019-10-15 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-29481:
---

 Summary: all the commands should respect the current 
catalog/namespace
 Key: SPARK-29481
 URL: https://issues.apache.org/jira/browse/SPARK-29481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


The newly added v2 commands support multi-catalog and respect the current 
catalog/namespace. However, it's not true for old v1 commands.

This leads to very confusing behaviors, for example
{code}
USE my_catalog
DESC t // success and describe the table t from my_catalog
ANALYZE TABLE t // fails as there is no table t in the session catalog
{code}

We should make sure all the commands have the same behavior regarding table 
resolution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29414) HasOutputCol param isSet() property is not preserved after persistence

2019-10-15 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952075#comment-16952075
 ] 

Huaxin Gao commented on SPARK-29414:


I somehow can't reproduce the problem. In my test, isSet returns false for 
loaded_model. Could you please try 2.4?

> HasOutputCol param isSet() property is not preserved after persistence
> --
>
> Key: SPARK-29414
> URL: https://issues.apache.org/jira/browse/SPARK-29414
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.2
>Reporter: Borys Biletskyy
>Priority: Major
>
> HasOutputCol param isSet() property is not preserved after saving and loading 
> using DefaultParamsReadable and DefaultParamsWritable.
> {code:java}
> import pytest
> from pyspark import keyword_only
> from pyspark.ml import Model
> from pyspark.sql import DataFrame
> from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
> from pyspark.ml.param.shared import HasInputCol, HasOutputCol
> from pyspark.sql.functions import *
> class HasOutputColTester(Model,
>  HasInputCol,
>  HasOutputCol,
>  DefaultParamsReadable,
>  DefaultParamsWritable
>  ):
> @keyword_only
> def __init__(self, inputCol: str = None, outputCol: str = None):
> super(HasOutputColTester, self).__init__()
> kwargs = self._input_kwargs
> self.setParams(**kwargs)
> @keyword_only
> def setParams(self, inputCol: str = None, outputCol: str = None):
> kwargs = self._input_kwargs
> self._set(**kwargs)
> return self
> def _transform(self, data: DataFrame) -> DataFrame:
> return data
> class TestHasInputColParam(object):
> def test_persist_input_col_set(self, spark, temp_dir):
> path = temp_dir + '/test_model'
> model = HasOutputColTester()
> assert not model.isDefined(model.inputCol)
> assert not model.isSet(model.inputCol)
> assert model.isDefined(model.outputCol)
> assert not model.isSet(model.outputCol)
> model.write().overwrite().save(path)
> loaded_model: HasOutputColTester = HasOutputColTester.load(path)
> assert not loaded_model.isDefined(model.inputCol)
> assert not loaded_model.isSet(model.inputCol)
> assert loaded_model.isDefined(model.outputCol)
> assert not loaded_model.isSet(model.outputCol)  # AssertionError: 
> assert not True
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29480) Ensure that each input row refers to a single relationship type

2019-10-15 Thread Martin Junghanns (Jira)

Martin Junghanns created SPARK-29480:


 Summary: Ensure that each input row refers to a single 
relationship type
 Key: SPARK-29480
 URL: https://issues.apache.org/jira/browse/SPARK-29480
 Project: Spark
  Issue Type: New Feature
  Components: Graph
Affects Versions: 3.0.0
Reporter: Martin Junghanns


As pointed out in https://github.com/apache/spark/pull/24851/files#r334704485 
we need to make sure that no row in an input Dataset representing different 
relationships types has multiple rel type columns set to {{true}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29479) Design more fluent API for Cypher querying

2019-10-15 Thread Martin Junghanns (Jira)

Martin Junghanns created SPARK-29479:


 Summary: Design more fluent API for Cypher querying
 Key: SPARK-29479
 URL: https://issues.apache.org/jira/browse/SPARK-29479
 Project: Spark
  Issue Type: New Feature
  Components: Graph
Affects Versions: 3.0.0
Reporter: Martin Junghanns


As pointed out in https://github.com/apache/spark/pull/24851/files#r334614538 
we would like to add a more fluent API for executing and parameterizing Cypher 
queries. This issue is supposed to track that progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2019-10-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28560.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25295
[https://github.com/apache/spark/pull/25295]

> Optimize shuffle reader to local shuffle reader when smj converted to bhj in 
> adaptive execution
> ---
>
> Key: SPARK-28560
> URL: https://issues.apache.org/jira/browse/SPARK-28560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to optimize the shuffle reader to local 
> shuffle reader when smj is converted to bhj in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2019-10-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28560:
---

Assignee: Ke Jia

> Optimize shuffle reader to local shuffle reader when smj converted to bhj in 
> adaptive execution
> ---
>
> Key: SPARK-28560
> URL: https://issues.apache.org/jira/browse/SPARK-28560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to optimize the shuffle reader to local 
> shuffle reader when smj is converted to bhj in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`

2019-10-15 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951909#comment-16951909
 ] 

Sean R. Owen commented on SPARK-27981:
--

I'm not sure there is a workaround possible for this one anytime soon. It's 
just a warning. I don't think we'd be able to resolve it.

> Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
> --
>
> Key: SPARK-27981
> URL: https://issues.apache.org/jira/browse/SPARK-27981
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This PR aims to remove the following warnings for `java.nio.Bits.unaligned` 
> at JDK9/10/11/12. Please note that there are more warnings which is beyond of 
> this PR's scope.
> {code}
> bin/spark-shell --driver-java-options=--illegal-access=warn
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar)
>  to method java.nio.Bits.unaligned()
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29473) move statement logical plans to a new file

2019-10-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29473.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26125
[https://github.com/apache/spark/pull/26125]

> move statement logical plans to a new file
> --
>
> Key: SPARK-29473
> URL: https://issues.apache.org/jira/browse/SPARK-29473
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29478) Improve tooltip information for AggregatedLogs Tab

2019-10-15 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951862#comment-16951862
 ] 

Ankit Raj Boudh commented on SPARK-29478:
-

I will check this issue

> Improve tooltip information for AggregatedLogs Tab
> --
>
> Key: SPARK-29478
> URL: https://issues.apache.org/jira/browse/SPARK-29478
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29478) Improve tooltip information for AggregatedLogs Tab

2019-10-15 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29478:


 Summary: Improve tooltip information for AggregatedLogs Tab
 Key: SPARK-29478
 URL: https://issues.apache.org/jira/browse/SPARK-29478
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Statistics under Streaming Tab

2019-10-15 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951858#comment-16951858
 ] 

Ankit Raj Boudh commented on SPARK-29477:
-

i will check this issue

> Improve tooltip information for Streaming Statistics under Streaming Tab
> 
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29477) Improve tooltip information for Streaming Statistics under Streaming Tab

2019-10-15 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29477:


 Summary: Improve tooltip information for Streaming Statistics 
under Streaming Tab
 Key: SPARK-29477
 URL: https://issues.apache.org/jira/browse/SPARK-29477
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-10-15 Thread mob-ai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mob-ai updated SPARK-29224:
---
Attachment: url_loss.xlsx

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Priority: Major
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-10-15 Thread mob-ai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mob-ai updated SPARK-29224:
---
Attachment: (was: url_loss.xlsx)

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Priority: Major
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab

2019-10-15 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951839#comment-16951839
 ] 

pavithra ramachandran commented on SPARK-29476:
---

i shall work on this

> Add tooltip information for Thread Dump links and Thread details table 
> columns in Executors Tab
> ---
>
> Key: SPARK-29476
> URL: https://issues.apache.org/jira/browse/SPARK-29476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab

2019-10-15 Thread jobit mathew (Jira)

jobit mathew created SPARK-29476:


 Summary: Add tooltip information for Thread Dump links and Thread 
details table columns in Executors Tab
 Key: SPARK-29476
 URL: https://issues.apache.org/jira/browse/SPARK-29476
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.0.0
Reporter: jobit mathew






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29475) Executors aren't marked as lost on OutOfMemoryError in YARN mode

2019-10-15 Thread Nikita Gorbachevski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikita Gorbachevski updated SPARK-29475:

Summary: Executors aren't marked as lost on OutOfMemoryError in YARN mode  
(was: Executors aren't marked as dead on OutOfMemoryError in YARN mode)

> Executors aren't marked as lost on OutOfMemoryError in YARN mode
> 
>
> Key: SPARK-29475
> URL: https://issues.apache.org/jira/browse/SPARK-29475
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Nikita Gorbachevski
>Priority: Major
>
> My spark-streaming application runs in yarn cluster mode with 
> ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
> batches were scheduled and application did not make any progress.
> Thread dump showed that all the streaming threads are blocked, infinitely 
> waiting for result from executor on 
> ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. 
> {code:java}
> "streaming-job-executor-11" - Thread t@324
>java.lang.Thread.State: WAITING
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <7fd76f11> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
> at 
> org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
> My tasks are short running and pretty simple, e.g. read raw data from Kafka, 
> normalize it and write back to Kafka.
> After logs analysis i noticed that there were lots of ``RpcTimeoutException`` 
> on both executor
> {code:java}
> driver-heartbeater WARN executor.Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> and driver sides during context clearing.
> {code:java}
> WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
> any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 
> WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
> any reply in 120 seconds. This timeout is controlled by 
> spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive 
> any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is 
> controlled by spark.rpc.askTimeout at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Also the only error on executors was
> {code:java}
> SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
> TERM
> {code}
> no exceptions at all.
> After digging into YARN logs i noticed that executors were killed by AM 
> because of high memory usage. 
> Also there were no any logs on driver side saying that executors were lost. 
> Thus seems that driver wasn't notified about this.
> Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend 
> which logs such message. ``exitExecutor`` method looks similar but it's 
> message should look differently.
> However i believe that driver is notified that executor is lost via async rpc 
> call, but if executor encounters issues with rpc because of high GC pressure 
> this message won't be send.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h.

[jira] [Updated] (SPARK-29475) Executors aren't marked as dead on OutOfMemoryError in YARN mode

2019-10-15 Thread Nikita Gorbachevski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikita Gorbachevski updated SPARK-29475:

Description: 
My spark-streaming application runs in yarn cluster mode with 
``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
batches were scheduled and application did not make any progress.

Thread dump showed that all the streaming threads are blocked, infinitely 
waiting for result from executor on 
``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. 
{code:java}
"streaming-job-executor-11" - Thread t@324
   java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <7fd76f11> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
My tasks are short running and pretty simple, e.g. read raw data from Kafka, 
normalize it and write back to Kafka.

After logs analysis i noticed that there were lots of ``RpcTimeoutException`` 
on both executor
{code:java}
driver-heartbeater WARN executor.Executor: Issue communicating with driver in 
heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
and driver sides during context clearing.
{code:java}
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by 
spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive 
any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is 
controlled by spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Also the only error on executors was
{code:java}
SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
{code}
no exceptions at all.

After digging into YARN logs i noticed that executors were killed by AM because 
of high memory usage. 

Also there were no any logs on driver side saying that executors were lost. 
Thus seems that driver wasn't notified about this.

Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which 
logs such message. ``exitExecutor`` method looks similar but it's message 
should look differently.

However i believe that driver is notified that executor is lost via async rpc 
call, but if executor encounters issues with rpc because of high GC pressure 
this message won't be send.

 

  was:
My spark-streaming application runs in yarn cluster mode with 
``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
batches were scheduled and application did not make any progress.

Thread dump showed that all the streaming threads are blocked, infinitely 
waiting for result from executor on 
``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``.

 
{code:java}
"streaming-job-executor-11" - Thread t@324
   java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <7fd76f11> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)

[jira] [Created] (SPARK-29475) Executors aren't marked as dead on OutOfMemoryError in YARN mode

2019-10-15 Thread Nikita Gorbachevski (Jira)

Nikita Gorbachevski created SPARK-29475:
---

 Summary: Executors aren't marked as dead on OutOfMemoryError in 
YARN mode
 Key: SPARK-29475
 URL: https://issues.apache.org/jira/browse/SPARK-29475
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Spark Core, YARN
Affects Versions: 2.3.0
Reporter: Nikita Gorbachevski


My spark-streaming application runs in yarn cluster mode with 
``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
batches were scheduled and application did not make any progress.

Thread dump showed that all the streaming threads are blocked, infinitely 
waiting for result from executor on 
``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``.

 
{code:java}
"streaming-job-executor-11" - Thread t@324
   java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <7fd76f11> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
 

My tasks are short running and pretty simple, e.g. read raw data from Kafka, 
normalize it and write back to Kafka.

After logs analysis i noticed that there were lots of ``RpcTimeoutException`` 
on both executor

 
{code:java}
driver-heartbeater WARN executor.Executor: Issue communicating with driver in 
heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
and driver sides during context clearing.

 

 
{code:java}
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by 
spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive 
any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is 
controlled by spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Also the only error on executors was

 

 
{code:java}
SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
{code}
no exceptions at all.

After digging into YARN logs i noticed that executors were killed by AM because 
of high memory usage. 

Also there were no any logs on driver side saying that executors were lost. 
Thus seems that driver wasn't notified about this.

Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which 
logs such message. ``exitExecutor`` method looks similar but it's message 
should look differently.

However i believe that driver is notified that executor is lost via async rpc 
call, but if executor encounters issues with rpc because of high GC pressure 
this message won't be send.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29474) CLI support for Spark-on-Docker-on-Yarn

2019-10-15 Thread Adam Antal (Jira)

Adam Antal created SPARK-29474:
--

 Summary: CLI support for Spark-on-Docker-on-Yarn
 Key: SPARK-29474
 URL: https://issues.apache.org/jira/browse/SPARK-29474
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell, YARN
Affects Versions: 2.4.4
Reporter: Adam Antal


The Docker-on-Yarn feature is stable for a while now in Hadoop.
One can run Spark on Docker using the Docker-on-Yarn feature by providing 
runtime environments to the Spark AM and Executor containers similar to this:
{noformat}
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf 
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag
--conf 
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro"
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag
--conf 
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro"
{noformat}

This is not very user friendly. I suggest to add CLI options to specify:
- whether docker image should be used ({{--docker}})
- which docker image should be used ({{--docker-image}})
- what docker mounts should be used ({{--docker-mounts}})
for the AM and executor containers separately.

Let's discuss!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29428) Can't persist/set None-valued param

2019-10-15 Thread Borys Biletskyy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951747#comment-16951747
 ] 

Borys Biletskyy edited comment on SPARK-29428 at 10/15/19 9:12 AM:
---

[~bryanc], I tried to implement some logic depending on the default/non-default 
value of parameters. I used isSet(...) method to understand if the parameter 
was set, until I found out that that after persistence isSet(...) was not 
preserved (https://issues.apache.org/jira/browse/SPARK-29414). So I decided to 
use None values as a workaround to distinguish default from non-default param 
values. It seemed to work, until I tried to persist it.

Having None params doesn't seem to be a good idea to me either, especially when 
there are isSet(...) and isDefined(...) methods.


was (Author: borys.biletskyy):
[~bryanc], I tried to implement some logic depending on the default/non-default 
value of parameters. I used isSet(...) method to understand if the parameter 
was set, until I found out that that after persistence isSet(...) was not 
preserved (https://issues.apache.org/jira/browse/SPARK-29414). So I decided to 
use None values as a workaround to distinguish default from non-default param 
values. It seemed to work, until I tried to persist it.

> Can't persist/set None-valued param 
> 
>
> Key: SPARK-29428
> URL: https://issues.apache.org/jira/browse/SPARK-29428
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.2
>Reporter: Borys Biletskyy
>Priority: Major
>
> {code:java}
> import pytest
> from pyspark import keyword_only
> from pyspark.ml import Model
> from pyspark.sql import DataFrame
> from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
> from pyspark.ml.param.shared import HasInputCol
> from pyspark.sql.functions import *
> class NoneParamTester(Model,
>   HasInputCol,
>   DefaultParamsReadable,
>   DefaultParamsWritable
>   ):
> @keyword_only
> def __init__(self, inputCol: str = None):
> super(NoneParamTester, self).__init__()
> kwargs = self._input_kwargs
> self.setParams(**kwargs)
> @keyword_only
> def setParams(self, inputCol: str = None):
> kwargs = self._input_kwargs
> self._set(**kwargs)
> return self
> def _transform(self, data: DataFrame) -> DataFrame:
> return data
> class TestNoneParam(object):
> def test_persist_none(self, spark, temp_dir):
> path = temp_dir + '/test_model'
> model = NoneParamTester(inputCol=None)
> assert model.isDefined(model.inputCol)
> assert model.isSet(model.inputCol)
> assert model.getInputCol() is None
> model.write().overwrite().save(path)
> NoneParamTester.load(path)  # TypeError: Could not convert  'NoneType'> to string type
> def test_set_none(self, spark):
> model = NoneParamTester(inputCol=None)
> assert model.isDefined(model.inputCol)
> assert model.isSet(model.inputCol)
> assert model.getInputCol() is None
> model.set(model.inputCol, None)  # TypeError: Could not convert 
>  to string type
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29428) Can't persist/set None-valued param

2019-10-15 Thread Borys Biletskyy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951747#comment-16951747
 ] 

Borys Biletskyy commented on SPARK-29428:
-

[~bryanc], I tried to implement some logic depending on the default/non-default 
value of parameters. I used isSet(...) method to understand if the parameter 
was set, until I found out that that after persistence isSet(...) was not 
preserved (https://issues.apache.org/jira/browse/SPARK-29414). So I decided to 
use None values as a workaround to distinguish default from non-default param 
values. It seemed to work, until I tried to persist it.

> Can't persist/set None-valued param 
> 
>
> Key: SPARK-29428
> URL: https://issues.apache.org/jira/browse/SPARK-29428
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.2
>Reporter: Borys Biletskyy
>Priority: Major
>
> {code:java}
> import pytest
> from pyspark import keyword_only
> from pyspark.ml import Model
> from pyspark.sql import DataFrame
> from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
> from pyspark.ml.param.shared import HasInputCol
> from pyspark.sql.functions import *
> class NoneParamTester(Model,
>   HasInputCol,
>   DefaultParamsReadable,
>   DefaultParamsWritable
>   ):
> @keyword_only
> def __init__(self, inputCol: str = None):
> super(NoneParamTester, self).__init__()
> kwargs = self._input_kwargs
> self.setParams(**kwargs)
> @keyword_only
> def setParams(self, inputCol: str = None):
> kwargs = self._input_kwargs
> self._set(**kwargs)
> return self
> def _transform(self, data: DataFrame) -> DataFrame:
> return data
> class TestNoneParam(object):
> def test_persist_none(self, spark, temp_dir):
> path = temp_dir + '/test_model'
> model = NoneParamTester(inputCol=None)
> assert model.isDefined(model.inputCol)
> assert model.isSet(model.inputCol)
> assert model.getInputCol() is None
> model.write().overwrite().save(path)
> NoneParamTester.load(path)  # TypeError: Could not convert  'NoneType'> to string type
> def test_set_none(self, spark):
> model = NoneParamTester(inputCol=None)
> assert model.isDefined(model.inputCol)
> assert model.isSet(model.inputCol)
> assert model.getInputCol() is None
> model.set(model.inputCol, None)  # TypeError: Could not convert 
>  to string type
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18748) UDF multiple evaluations causes very poor performance

2019-10-15 Thread Enrico Minack (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951672#comment-16951672
 ] 

Enrico Minack edited comment on SPARK-18748 at 10/15/19 7:16 AM:
-

I think the behaviour of {{yourUdf.asNondeterministic()}} is exactly what you 
want in this situation. It is not just a work-around, it is the right way to 
make Spark call the udf exactly once per row. For [~hqb1989], this is the 
perfect solution for you as your udf actually is non-deterministic.

The Analyzer avoids calling the method multiple times for the same row because 
it thinks it does not produce the same result for the same input. For 
+deterministic+ but expensive udfs, this produces the desired behaviour, but it 
is counterintuitive to call them +non-deterministic.+

Maybe there should also be an {{asExpensive()}} method to flag a udf as 
expensive, so the analyzer / optimizer does exactly what it currently does for 
non-deterministic udfs.


was (Author: enricomi):
I think the behaviour of {{asNondeterministic()}} is exactly what you want in 
this situation. It is not just a work-around, it is the right way to make Spark 
call the udf exactly once per row. For [~hqb1989], this is the perfect solution 
for you as your udf actually is non-deterministic.

The Analyzer avoids calling the method multiple times for the same row because 
it thinks it does not produce the same result for the same input. For 
+deterministic+ but expensive udfs, this produces the desired behaviour, but it 
is counterintuitive to call them +non-deterministic.+

Maybe there should also be an {{asExpensive()}} method to flag a udf as 
expensive, so the analyzer / optimizer does exactly what it currently does for 
non-deterministic udfs.

> UDF multiple evaluations causes very poor performance
> -
>
> Key: SPARK-18748
> URL: https://issues.apache.org/jira/browse/SPARK-18748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> We have a use case where we have a relatively expensive UDF that needs to be 
> calculated. The problem is that instead of being calculated once, it gets 
> calculated over and over again.
> for example:
> {quote}
> def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\}
> hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _)
> hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is 
> not null and c<>''").show
> {quote}
> with the output:
> {quote}
> blahblah1
> blahblah1
> blahblah1
> +---+
> |  c|
> +---+
> |nothing|
> +---+
> {quote}
> You can see that for each reference of column "c" you will get the println.
> that causes very poor performance for our real use case.
> This also came out on StackOverflow:
> http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns
> http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/
> with two problematic work-arounds:
> 1. cache() after the first time. e.g.
> {quote}
> hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not 
> null and c<>''").show
> {quote}
> while it works, in our case we can't do that because the table is too big to 
> cache.
> 2. move back and forth to rdd:
> {quote}
> val df = hiveContext.sql("select veryExpensiveCalc('a') as c")
> hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and 
> c<>''").show
> {quote}
> which works but then we loose some of the optimizations like push down 
> predicate features, etc. and its very ugly.
> Any ideas on how we can make the UDF get calculated just once in a reasonable 
> way?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18748) UDF multiple evaluations causes very poor performance

2019-10-15 Thread Enrico Minack (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951672#comment-16951672
 ] 

Enrico Minack commented on SPARK-18748:
---

I think the behaviour of {{asNondeterministic()}} is exactly what you want in 
this situation. It is not just a work-around, it is the right way to make Spark 
call the udf exactly once per row. For [~hqb1989], this is the perfect solution 
for you as your udf actually is non-deterministic.

The Analyzer avoids calling the method multiple times for the same row because 
it thinks it does not produce the same result for the same input. For 
+deterministic+ but expensive udfs, this produces the desired behaviour, but it 
is counterintuitive to call them +non-deterministic.+

Maybe there should also be an {{asExpensive()}} method to flag a udf as 
expensive, so the analyzer / optimizer does exactly what it currently does for 
non-deterministic udfs.

> UDF multiple evaluations causes very poor performance
> -
>
> Key: SPARK-18748
> URL: https://issues.apache.org/jira/browse/SPARK-18748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> We have a use case where we have a relatively expensive UDF that needs to be 
> calculated. The problem is that instead of being calculated once, it gets 
> calculated over and over again.
> for example:
> {quote}
> def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\}
> hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _)
> hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is 
> not null and c<>''").show
> {quote}
> with the output:
> {quote}
> blahblah1
> blahblah1
> blahblah1
> +---+
> |  c|
> +---+
> |nothing|
> +---+
> {quote}
> You can see that for each reference of column "c" you will get the println.
> that causes very poor performance for our real use case.
> This also came out on StackOverflow:
> http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns
> http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/
> with two problematic work-arounds:
> 1. cache() after the first time. e.g.
> {quote}
> hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not 
> null and c<>''").show
> {quote}
> while it works, in our case we can't do that because the table is too big to 
> cache.
> 2. move back and forth to rdd:
> {quote}
> val df = hiveContext.sql("select veryExpensiveCalc('a') as c")
> hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and 
> c<>''").show
> {quote}
> which works but then we loose some of the optimizations like push down 
> predicate features, etc. and its very ugly.
> Any ideas on how we can make the UDF get calculated just once in a reasonable 
> way?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

83 matches

Mail list logo