[jira] [Commented] (SPARK-39750) Enable spark.sql.cbo.enabled by default

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565291#comment-17565291
 ] 

Apache Spark commented on SPARK-39750:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37163

> Enable spark.sql.cbo.enabled by default
> ---
>
> Key: SPARK-39750
> URL: https://issues.apache.org/jira/browse/SPARK-39750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39750) Enable spark.sql.cbo.enabled by default

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39750:


Assignee: (was: Apache Spark)

> Enable spark.sql.cbo.enabled by default
> ---
>
> Key: SPARK-39750
> URL: https://issues.apache.org/jira/browse/SPARK-39750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39750) Enable spark.sql.cbo.enabled by default

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565289#comment-17565289
 ] 

Apache Spark commented on SPARK-39750:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37163

> Enable spark.sql.cbo.enabled by default
> ---
>
> Key: SPARK-39750
> URL: https://issues.apache.org/jira/browse/SPARK-39750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39750) Enable spark.sql.cbo.enabled by default

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39750:


Assignee: Apache Spark

> Enable spark.sql.cbo.enabled by default
> ---
>
> Key: SPARK-39750
> URL: https://issues.apache.org/jira/browse/SPARK-39750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39750) Enable spark.sql.cbo.enabled by default

2022-07-11 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-39750:
---

 Summary: Enable spark.sql.cbo.enabled by default
 Key: SPARK-39750
 URL: https://issues.apache.org/jira/browse/SPARK-39750
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39647) Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when the NodeManager hasn't been

2022-07-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-39647.
-
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37052
[https://github.com/apache/spark/pull/37052]

> Block push fails with java.lang.IllegalArgumentException: Active local dirs 
> list has not been updated by any executor registration even when the 
> NodeManager hasn't been restarted
> --
>
> Key: SPARK-39647
> URL: https://issues.apache.org/jira/browse/SPARK-39647
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>
> We saw these exceptions during block push:
> {code:java}
> 22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block 
> shuffle_170_568_174, and will not retry (0 retries)
> org.apache.spark.network.shuffle.BlockPushException: 
> !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException:
>  Active local dirs list has not been updated by any executor registration
>   at 
> org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168)
> 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 
> to BlockManagerId(, node-x, 7337, None) failed.
> {code}
> Note: The NodeManager on node-x (node against which this exception was seen) 
> was not restarted.
> The reason this happened is because the executor registers the block manager 
> with {{BlockManagerMaster}} before it registers with the ESS. In push-based 
> shuffle, a block manager is selected by the driver as a merger for the 
> shuffle push. However, the ESS on that node can successfully merge the block 
> only if it has received the metadata about merged directories from the local 
> executor (sent when the local executor registers with the ESS). If this local 
> executor registration is delayed, but the ESS host got picked up as a merger 
> then it will fail to merge the blocks pushed to it which is what happened 
> here.
> The local executor on node-x is executor 754 and the block manager 
> registration happened at 13:28:11
> {code:java}
> 22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has 
> registered (new total is 1200)
> 22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager 
> node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None)
> {code}
> The application got registered with shuffle server at node-x at 13:29:40
> {code:java}
> 2022-06-24 13:29:40,343 INFO 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active 
> local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, 
> /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, 
> /grid/c/tmp/yarn/] for application application_1653753500486_3193550
>  {code}
> node-x was selected as a merger by the driver after 13:28:11 and when the 
> executors started pushing to it, all those pushes failed until 13:29:40
> We can fix by having the executor register with ESS before it registers the 
> block manager with the {{BlockManagerMaster}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39647) Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when the NodeManager hasn't been

2022-07-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-39647:
---

Assignee: Chandni Singh

> Block push fails with java.lang.IllegalArgumentException: Active local dirs 
> list has not been updated by any executor registration even when the 
> NodeManager hasn't been restarted
> --
>
> Key: SPARK-39647
> URL: https://issues.apache.org/jira/browse/SPARK-39647
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> We saw these exceptions during block push:
> {code:java}
> 22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block 
> shuffle_170_568_174, and will not retry (0 retries)
> org.apache.spark.network.shuffle.BlockPushException: 
> !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException:
>  Active local dirs list has not been updated by any executor registration
>   at 
> org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312)
>   at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168)
> 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 
> to BlockManagerId(, node-x, 7337, None) failed.
> {code}
> Note: The NodeManager on node-x (node against which this exception was seen) 
> was not restarted.
> The reason this happened is because the executor registers the block manager 
> with {{BlockManagerMaster}} before it registers with the ESS. In push-based 
> shuffle, a block manager is selected by the driver as a merger for the 
> shuffle push. However, the ESS on that node can successfully merge the block 
> only if it has received the metadata about merged directories from the local 
> executor (sent when the local executor registers with the ESS). If this local 
> executor registration is delayed, but the ESS host got picked up as a merger 
> then it will fail to merge the blocks pushed to it which is what happened 
> here.
> The local executor on node-x is executor 754 and the block manager 
> registration happened at 13:28:11
> {code:java}
> 22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has 
> registered (new total is 1200)
> 22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager 
> node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None)
> {code}
> The application got registered with shuffle server at node-x at 13:29:40
> {code:java}
> 2022-06-24 13:29:40,343 INFO 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active 
> local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, 
> /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, 
> /grid/c/tmp/yarn/] for application application_1653753500486_3193550
>  {code}
> node-x was selected as a merger by the driver after 13:28:11 and when the 
> executors started pushing to it, all those pushes failed until 13:29:40
> We can fix by having the executor register with ESS before it registers the 
> block manager with the {{BlockManagerMaster}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38910) Clean sparkStaging dir should before unregister()

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565243#comment-17565243
 ] 

Apache Spark commented on SPARK-38910:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/37162

> Clean sparkStaging dir should before unregister()
> -
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
>   ShutdownHookManager.addShutdownHook(priority) { () =>
> try {
>   val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
>   val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts
>   if (!finished) {
> // The default state of ApplicationMaster is failed if it is 
> invoked by shut down hook.
> // This behavior is different compared to 1.x version.
> // If user application is exited ahead of time by calling 
> System.exit(N), here mark
> // this application as failed with EXIT_EARLY. For a good 
> shutdown, user shouldn't call
> // System.exit(0) to terminate the application.
> finish(finalStatus,
>   ApplicationMaster.EXIT_EARLY,
>   "Shutdown hook called before final status was reported.")
>   }
>   if (!unregistered) {
> // we only want to unregister if we don't want the RM to retry
> if (finalStatus == FinalApplicationStatus.SUCCEEDED || 
> isLastAttempt) {
>   unregister(finalStatus, finalMsg)
>   cleanupStagingDir(new 
> Path(System.getenv("SPARK_YARN_STAGING_DIR")))
> }
>   }
> } catch {
>   case e: Throwable =>
> logWarning("Ignoring Exception while stopping ApplicationMaster 
> from shutdown hook", e)
> }
>   }{code}
> unregister may throw exception, clean staging dir should before unregister.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38910) Clean sparkStaging dir should before unregister()

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565242#comment-17565242
 ] 

Apache Spark commented on SPARK-38910:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/37162

> Clean sparkStaging dir should before unregister()
> -
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
>   ShutdownHookManager.addShutdownHook(priority) { () =>
> try {
>   val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
>   val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts
>   if (!finished) {
> // The default state of ApplicationMaster is failed if it is 
> invoked by shut down hook.
> // This behavior is different compared to 1.x version.
> // If user application is exited ahead of time by calling 
> System.exit(N), here mark
> // this application as failed with EXIT_EARLY. For a good 
> shutdown, user shouldn't call
> // System.exit(0) to terminate the application.
> finish(finalStatus,
>   ApplicationMaster.EXIT_EARLY,
>   "Shutdown hook called before final status was reported.")
>   }
>   if (!unregistered) {
> // we only want to unregister if we don't want the RM to retry
> if (finalStatus == FinalApplicationStatus.SUCCEEDED || 
> isLastAttempt) {
>   unregister(finalStatus, finalMsg)
>   cleanupStagingDir(new 
> Path(System.getenv("SPARK_YARN_STAGING_DIR")))
> }
>   }
> } catch {
>   case e: Throwable =>
> logWarning("Ignoring Exception while stopping ApplicationMaster 
> from shutdown hook", e)
> }
>   }{code}
> unregister may throw exception, clean staging dir should before unregister.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39723) Implement functionExists/getFunc in SparkR for 3L namespace

2022-07-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-39723.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37135
[https://github.com/apache/spark/pull/37135]

> Implement functionExists/getFunc in SparkR for 3L namespace
> ---
>
> Key: SPARK-39723
> URL: https://issues.apache.org/jira/browse/SPARK-39723
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39723) Implement functionExists/getFunc in SparkR for 3L namespace

2022-07-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-39723:
-

Assignee: Ruifeng Zheng

> Implement functionExists/getFunc in SparkR for 3L namespace
> ---
>
> Key: SPARK-39723
> URL: https://issues.apache.org/jira/browse/SPARK-39723
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39749) Always use plain string representation on casting Decimal to String

2022-07-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39749:
---
Summary: Always use plain string representation on casting Decimal to 
String  (was: Use plain string representation on casting Decimal to String)

> Always use plain string representation on casting Decimal to String
> ---
>
> Key: SPARK-39749
> URL: https://issues.apache.org/jira/browse/SPARK-39749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, casting decimal as string type will result in Strings with 
> exponential notations if the adjusted exponent is less than -6. This is 
> consistent with BigDecimal.toString 
> [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString]
>  
> This is different from external databases like PostgreSQL/Oracle/MS SQL 
> server. It doesn't compliant with the ANSI SQL standard either. 
> I suggest always using the plain string in the casting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39748:


Assignee: (was: Apache Spark)

> Include the origin logical plan for LogicalRDD if it comes from DataFrame
> -
>
> Key: SPARK-39748
> URL: https://issues.apache.org/jira/browse/SPARK-39748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> When Spark converts the DataFrame to LogicalRDD for some reason (e.g. 
> foreachBatch sink), Spark just picks the RDD from the origin DataFrame and 
> discards the (logical/physical) plan.
> The origin logical plan can be useful for several use cases, including:
> 1. wants to connect the overall logical plan into one
> 2. inherits plan stats from origin logical plan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39748:


Assignee: Apache Spark

> Include the origin logical plan for LogicalRDD if it comes from DataFrame
> -
>
> Key: SPARK-39748
> URL: https://issues.apache.org/jira/browse/SPARK-39748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> When Spark converts the DataFrame to LogicalRDD for some reason (e.g. 
> foreachBatch sink), Spark just picks the RDD from the origin DataFrame and 
> discards the (logical/physical) plan.
> The origin logical plan can be useful for several use cases, including:
> 1. wants to connect the overall logical plan into one
> 2. inherits plan stats from origin logical plan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565236#comment-17565236
 ] 

Apache Spark commented on SPARK-39748:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37161

> Include the origin logical plan for LogicalRDD if it comes from DataFrame
> -
>
> Key: SPARK-39748
> URL: https://issues.apache.org/jira/browse/SPARK-39748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> When Spark converts the DataFrame to LogicalRDD for some reason (e.g. 
> foreachBatch sink), Spark just picks the RDD from the origin DataFrame and 
> discards the (logical/physical) plan.
> The origin logical plan can be useful for several use cases, including:
> 1. wants to connect the overall logical plan into one
> 2. inherits plan stats from origin logical plan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39749) Use plain string representation on casting Decimal to String

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565235#comment-17565235
 ] 

Apache Spark commented on SPARK-39749:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37160

> Use plain string representation on casting Decimal to String
> 
>
> Key: SPARK-39749
> URL: https://issues.apache.org/jira/browse/SPARK-39749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, casting decimal as string type will result in Strings with 
> exponential notations if the adjusted exponent is less than -6. This is 
> consistent with BigDecimal.toString 
> [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString]
>  
> This is different from external databases like PostgreSQL/Oracle/MS SQL 
> server. It doesn't compliant with the ANSI SQL standard either. 
> I suggest always using the plain string in the casting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39749) Use plain string representation on casting Decimal to String

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39749:


Assignee: Apache Spark  (was: Gengliang Wang)

> Use plain string representation on casting Decimal to String
> 
>
> Key: SPARK-39749
> URL: https://issues.apache.org/jira/browse/SPARK-39749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, casting decimal as string type will result in Strings with 
> exponential notations if the adjusted exponent is less than -6. This is 
> consistent with BigDecimal.toString 
> [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString]
>  
> This is different from external databases like PostgreSQL/Oracle/MS SQL 
> server. It doesn't compliant with the ANSI SQL standard either. 
> I suggest always using the plain string in the casting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39749) Use plain string representation on casting Decimal to String

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565234#comment-17565234
 ] 

Apache Spark commented on SPARK-39749:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37160

> Use plain string representation on casting Decimal to String
> 
>
> Key: SPARK-39749
> URL: https://issues.apache.org/jira/browse/SPARK-39749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, casting decimal as string type will result in Strings with 
> exponential notations if the adjusted exponent is less than -6. This is 
> consistent with BigDecimal.toString 
> [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString]
>  
> This is different from external databases like PostgreSQL/Oracle/MS SQL 
> server. It doesn't compliant with the ANSI SQL standard either. 
> I suggest always using the plain string in the casting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39749) Use plain string representation on casting Decimal to String

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39749:


Assignee: Gengliang Wang  (was: Apache Spark)

> Use plain string representation on casting Decimal to String
> 
>
> Key: SPARK-39749
> URL: https://issues.apache.org/jira/browse/SPARK-39749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, casting decimal as string type will result in Strings with 
> exponential notations if the adjusted exponent is less than -6. This is 
> consistent with BigDecimal.toString 
> [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString]
>  
> This is different from external databases like PostgreSQL/Oracle/MS SQL 
> server. It doesn't compliant with the ANSI SQL standard either. 
> I suggest always using the plain string in the casting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39711) Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging

2022-07-11 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-39711.

Fix Version/s: 3.4.0
 Assignee: BingKun Pan
   Resolution: Fixed

> Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging
> 
>
> Key: SPARK-39711
> URL: https://issues.apache.org/jira/browse/SPARK-39711
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> SparkFunSuite declare as follow:
> {code:java}
> abstract class SparkFunSuite
> extends AnyFunSuite
> with BeforeAndAfterAll
> with BeforeAndAfterEach
> with ThreadAudit
> with Logging
> {code}
> some suite extends SparkFunSuite and meanwhile with BeforeAndAfterAll or 
> BeforeAndAfterEach or Logging, it is redundant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39749) Use plain string representation on casting Decimal to String

2022-07-11 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-39749:
--

 Summary: Use plain string representation on casting Decimal to 
String
 Key: SPARK-39749
 URL: https://issues.apache.org/jira/browse/SPARK-39749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Currently, casting decimal as string type will result in Strings with 
exponential notations if the adjusted exponent is less than -6. This is 
consistent with BigDecimal.toString 
[https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString] 

This is different from external databases like PostgreSQL/Oracle/MS SQL server. 
It doesn't compliant with the ANSI SQL standard either. 

I suggest always using the plain string in the casting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39737) PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter

2022-07-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39737.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37150
[https://github.com/apache/spark/pull/37150]

> PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter 
> 
>
> Key: SPARK-39737
> URL: https://issues.apache.org/jira/browse/SPARK-39737
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, Spark support ANSI Aggregation Function percentile_cont and 
> percentile_disc.
> But the two aggregate functions does not support aggregate filter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39737) PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter

2022-07-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39737:
---

Assignee: jiaan.geng

> PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter 
> 
>
> Key: SPARK-39737
> URL: https://issues.apache.org/jira/browse/SPARK-39737
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently, Spark support ANSI Aggregation Function percentile_cont and 
> percentile_disc.
> But the two aggregate functions does not support aggregate filter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame

2022-07-11 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-39748:


 Summary: Include the origin logical plan for LogicalRDD if it 
comes from DataFrame
 Key: SPARK-39748
 URL: https://issues.apache.org/jira/browse/SPARK-39748
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


When Spark converts the DataFrame to LogicalRDD for some reason (e.g. 
foreachBatch sink), Spark just picks the RDD from the origin DataFrame and 
discards the (logical/physical) plan.

The origin logical plan can be useful for several use cases, including:

1. wants to connect the overall logical plan into one
2. inherits plan stats from origin logical plan




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39741) Support url encode/decode as built-in function

2022-07-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-39741:

Fix Version/s: (was: 3.4.0)

> Support url encode/decode as built-in function
> --
>
> Key: SPARK-39741
> URL: https://issues.apache.org/jira/browse/SPARK-39741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Minor
>
> Currently, Spark don't support url encode/decode as built-in functions, the 
> user might use reflect instead, It's a bit of a hassle, And often these 
> functions are useful.
> This pr aims to add url encode/decode as built-in function support.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39723) Implement functionExists/getFunc in SparkR for 3L namespace

2022-07-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-39723:
--
Summary: Implement functionExists/getFunc in SparkR for 3L namespace  (was: 
Implement functionExists/getFunction in SparkR for 3L namespace)

> Implement functionExists/getFunc in SparkR for 3L namespace
> ---
>
> Key: SPARK-39723
> URL: https://issues.apache.org/jira/browse/SPARK-39723
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39736) Enable base image build in SparkR job

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39736:


Assignee: Yikun Jiang

> Enable base image build in SparkR job
> -
>
> Key: SPARK-39736
> URL: https://issues.apache.org/jira/browse/SPARK-39736
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39736) Enable base image build in SparkR job

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39736.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37158
[https://github.com/apache/spark/pull/37158]

> Enable base image build in SparkR job
> -
>
> Key: SPARK-39736
> URL: https://issues.apache.org/jira/browse/SPARK-39736
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39747) pandas and pandas on Spark API parameter naming difference

2022-07-11 Thread Chenyang Zhang (Jira)
Chenyang Zhang created SPARK-39747:
--

 Summary: pandas and pandas on Spark API parameter naming 
difference 
 Key: SPARK-39747
 URL: https://issues.apache.org/jira/browse/SPARK-39747
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark
Affects Versions: 3.3.0
Reporter: Chenyang Zhang


I noticed there are some parameter naming differences between pandas and pandas 
on Spark. For example, in "read_csv", the path parameter is 
"filepath_or_buffer" for pandas and "path" for pandas on Spark. I wonder why 
such a difference exists and may I ask to change it to match exactly the same 
in pandas. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39696) Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration

2022-07-11 Thread Stephen Mcmullan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Mcmullan updated SPARK-39696:
-
   Fix Version/s: (was: 3.3.1)
Target Version/s:   (was: 3.3.0)

> Uncaught exception in thread executor-heartbeater 
> java.util.ConcurrentModificationException: mutation occurred during iteration
> ---
>
> Key: SPARK-39696
> URL: https://issues.apache.org/jira/browse/SPARK-39696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 (spark-3.3.0-bin-hadoop3-scala2.13 
> distribution)
> Scala 2.13.8 / OpenJDK 17.0.3 application compilation
> Alpine Linux 3.14.3
> JVM OpenJDK 64-Bit Server VM Temurin-17.0.1+12
>Reporter: Stephen Mcmullan
>Priority: Major
>
> {noformat}
> 2022-06-21 18:17:49.289Z ERROR [executor-heartbeater] 
> org.apache.spark.util.Utils - Uncaught exception in thread 
> executor-heartbeater
> java.util.ConcurrentModificationException: mutation occurred during iteration
> at 
> scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:873) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:869) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.immutable.VectorStatics$.append1IfSpace(Vector.scala:1959) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector1.appendedAll0(Vector.scala:425) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:203) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:113) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat$(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractSeq.concat(Seq.scala:1161) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus$(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.$plus$plus(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.TaskMetrics.accumulators(TaskMetrics.scala:261) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$1(Executor.scala:1042)
>  ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1036) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
> ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  ~[?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  ~[?:?]
> at 
> ja

[jira] [Created] (SPARK-39746) Binary array operations can be faster if one side is a constant

2022-07-11 Thread David Vogelbacher (Jira)
David Vogelbacher created SPARK-39746:
-

 Summary: Binary array operations can be faster if one side is a 
constant
 Key: SPARK-39746
 URL: https://issues.apache.org/jira/browse/SPARK-39746
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: David Vogelbacher


Array operations such as 
[ArraysOverlap|https://github.com/apache/spark/blob/79f133b7bbc1d9aa6a20dd8a34ec120902f96155/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1367]
 are optimized to put all the elements of the smaller array into a HashSet, if 
elements properly support equals. 
However, if one of the arrays is a constant, we could do much better as we 
don't have to reconstruct the HashSet for each row, we could construct it just 
once and send it to all the executors. This would improve runtime by a constant 
factor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39745) Accept a list that contains NumPy scalars in `createDataFrame`

2022-07-11 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39745:


 Summary: Accept a list that contains NumPy scalars in 
`createDataFrame`
 Key: SPARK-39745
 URL: https://issues.apache.org/jira/browse/SPARK-39745
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Currently, only lists of native Python scalars are accepted in 
`createDataFrame`.

We should support Numpy scalars as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided

2022-07-11 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39494:
-
Description: 
Currently, DataFrame creation from a list of native Python scalars is 
unsupported in PySpark, for example,

{{>>> spark.createDataFrame([1, 2]).collect()}}
{{Traceback (most recent call last):}}
{{...}}
{{TypeError: Can not infer schema for type: }}

{{However, Spark DataFrame Scala API supports that:}}

{{scala> Seq(1, 2).toDF().collect()}}
{{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. 

See more 
[here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).

  was:
 

{{>>> spark.createDataFrame([1, 2]).collect()}}
{{Traceback (most recent call last):}}
{{...}}
{{TypeError: Can not infer schema for type: }}

{{However, Spark DataFrame Scala API supports that:}}

{{scala> Seq(1, 2).toDF().collect()}}
{{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. 

See more 
[here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).


> Support `createDataFrame` from a list of scalars when schema is not provided
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of native Python scalars is 
> unsupported in PySpark, for example,
> {{>>> spark.createDataFrame([1, 2]).collect()}}
> {{Traceback (most recent call last):}}
> {{...}}
> {{TypeError: Can not infer schema for type: }}
> {{However, Spark DataFrame Scala API supports that:}}
> {{scala> Seq(1, 2).toDF().collect()}}
> {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. 
> See more 
> [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided

2022-07-11 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39494:
-
Description: 
 

{{>>> spark.createDataFrame([1, 2]).collect()}}
{{Traceback (most recent call last):}}
{{...}}
{{TypeError: Can not infer schema for type: }}

{{However, Spark DataFrame Scala API supports that:}}

{{scala> Seq(1, 2).toDF().collect()}}
{{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. 

See more 
[here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).

  was:
{{Currently, DataFrame creation from a list of native Python scalars is 
unsupported in PySpark, for example,}}

{{>>> spark.createDataFrame([1, 2]).collect()}}
{{Traceback (most recent call last):}}
{{...}}
{{TypeError: Can not infer schema for type: }}

{{However, Spark DataFrame Scala API supports that:}}

{{scala> Seq(1, 2).toDF().collect()}}
{{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. 

See more 
[here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).


> Support `createDataFrame` from a list of scalars when schema is not provided
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {{>>> spark.createDataFrame([1, 2]).collect()}}
> {{Traceback (most recent call last):}}
> {{...}}
> {{TypeError: Can not infer schema for type: }}
> {{However, Spark DataFrame Scala API supports that:}}
> {{scala> Seq(1, 2).toDF().collect()}}
> {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. 
> See more 
> [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided

2022-07-11 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39494:
-
Description: 
{{Currently, DataFrame creation from a list of native Python scalars is 
unsupported in PySpark, for example,}}

{{>>> spark.createDataFrame([1, 2]).collect()}}
{{Traceback (most recent call last):}}
{{...}}
{{TypeError: Can not infer schema for type: }}

{{However, Spark DataFrame Scala API supports that:}}

{{scala> Seq(1, 2).toDF().collect()}}
{{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. 

See more 
[here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).

  was:
{{Currently, DataFrame creation from a list of scalars is unsupported in 
PySpark, for example,}}

{{>>> spark.createDataFrame([1, 2]).collect()}}
{{Traceback (most recent call last):}}
{{...}}
{{TypeError: Can not infer schema for type: }}

{{However, Spark DataFrame Scala API supports that:}}

{{scala> Seq(1, 2).toDF().collect()}}
{{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. 

See more 
[here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).


> Support `createDataFrame` from a list of scalars when schema is not provided
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {{Currently, DataFrame creation from a list of native Python scalars is 
> unsupported in PySpark, for example,}}
> {{>>> spark.createDataFrame([1, 2]).collect()}}
> {{Traceback (most recent call last):}}
> {{...}}
> {{TypeError: Can not infer schema for type: }}
> {{However, Spark DataFrame Scala API supports that:}}
> {{scala> Seq(1, 2).toDF().collect()}}
> {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. 
> See more 
> [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38796) Implement the to_number and try_to_number SQL functions according to a new specification

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565088#comment-17565088
 ] 

Apache Spark commented on SPARK-38796:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37159

> Implement the to_number and try_to_number SQL functions according to a new 
> specification
> 
>
> Key: SPARK-38796
> URL: https://issues.apache.org/jira/browse/SPARK-38796
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.3.0
>
>
> This tracks implementing the 'to_number' and 'try_to_number' SQL function 
> expressions according to new semantics described below. The former is 
> equivalent to the latter except that it throws an exception instead of 
> returning NULL for cases where the input string does not match the format 
> string.
>  
> ---
>  
> *try_to_number function (expr, fmt):*
> Returns 'expr' cast to DECIMAL using formatting 'fmt', or 'NULL' if 'expr' is 
> not a valid match for the given format.
>  
> Syntax: 
> [ S ] [ L | $ ]
> [ 0 | 9 | G | , ] [...]
> [ . | D ] 
> [ 0 | 9 ] [...]       
> [ L | $ ] [ PR | MI | S ] ' }
>  
> *Arguments:*
> 'expr': A STRING expression representing a number. 'expr' may include leading 
> or trailing spaces.
> 'fmt': An STRING literal, specifying the expected format of 'expr'.
>  
> *Returns:*
> A DECIMAL(p, s) where 'p' is the total number of digits ('0' or '9') and 's' 
> is the number of digits after the decimal point, or 0 if there is none.
>  
> *Format elements allowed (case insensitive):*
>  * 0 or 9
>   Specifies an expected digit between '0' and '9'. 
>   A '0' to the left of the decimal points indicates that 'expr' must have at 
> least as many digits. A leading '9' indicates that 'expr' may omit these 
> digits.
>   'expr' must not be larger than the number of digits to the left of the 
> decimal point allowed by the format string.
>   Digits to the right of the decimal point in the format string indicate the 
> most digits that 'expr' may have to the right of the decimal point.
>  * . or D
>   Specifies the position of the decimal point.
>   'expr' does not need to include a decimal point.
>  * , or G
>   Specifies the position of the ',' grouping (thousands) separator.
>   There must be a '0' or '9' to the left of the rightmost grouping separator. 
>   'expr' must match the grouping separator relevant for the size of the 
> number. 
>  * $
>   Specifies the location of the '$' currency sign. This character may only be 
> specified once.
>  * S 
>   Specifies the position of an option '+' or '-' sign. This character may 
> only be specified once.
>  * MI
>   Specifies that 'expr' has an optional '-' sign at the end, but no '+'.
>  * PR
>   Specifies that 'expr' indicates a negative number with wrapping angled 
> brackets ('<1>'). If 'expr' contains any characters other than '0' through 
> '9' and those permitted in 'fmt' a 'NULL' is returned.
>  
> *Examples:*
> {{– The format expects:}}
> {{–  * an optional sign at the beginning,}}
> {{–  * followed by a dollar sign,}}
> {{–  * followed by a number between 3 and 6 digits long,}}
> {{–  * thousands separators,}}
> {{–  * up to two dights beyond the decimal point. }}
> {{> SELECT try_to_number('-$12,345.67', 'S$999,099.99');}}
> {{ -12345.67}}
> {{– The plus sign is optional, and so are fractional digits.}}
> {{> SELECT try_to_number('$345', 'S$999,099.99');}}
> {{ 345.00}}
> {{– The format requires at least three digits.}}
> {{> SELECT try_to_number('$45', 'S$999,099.99');}}
> {{ NULL}}
> {{– The format requires at least three digits.}}
> {{> SELECT try_to_number('$045', 'S$999,099.99');}}
> {{ 45.00}}
> {{– Using brackets to denote negative values}}
> {{> SELECT try_to_number('<1234>', '99PR');}}
> {{ -1234}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38999) Refactor DataSourceScanExec code to

2022-07-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38999:
--
Target Version/s:   (was: 3.3.0, 3.2.2, 3.4.0)

> Refactor DataSourceScanExec code to 
> 
>
> Key: SPARK-38999
> URL: https://issues.apache.org/jira/browse/SPARK-38999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently the code for `FileSourceScanExec` class, the physical node for the 
> file scans is quite complex and lengthy. The class should be refactored into 
> a trait `FileSourceScanLike` which implements basic functionality like 
> metrics and file listing. The execution specific code can then live inside 
> `FileSourceScanExec` which will subclass `FileSourceScanLike`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38862) Basic Authentication or Token Based Authentication for The REST Submission Server

2022-07-11 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565057#comment-17565057
 ] 

Dongjoon Hyun commented on SPARK-38862:
---

I removed the invalid versions from `Affected Versions` and `Target Versions` 
field.

> Basic Authentication or Token Based Authentication for The REST Submission 
> Server
> -
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.4.0
>Reporter: Jack
>Priority: Major
>  Labels: authentication, rest, spark, spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> I appreciate that one could implement a custom proxy to provide this 
> authentication check, but it seems like a common use case that others may 
> benefit from to be able to authenticate against the rest submission endpoint, 
> and by implementing this capability as an optionally configurable aspect of 
> Spark itself, we can utilise the existing server to provide this check.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin needed some more advanced authentication check, such as RBAC et 
> al, it could be facilitated without the need for writing a complete custom 
> proxy layer; but I do feel there should be some basic built in available; eg. 
> RestSubmissionBasicAuthenticator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38862) Basic Authentication or Token Based Authentication for The REST Submission Server

2022-07-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38862:
--
Target Version/s:   (was: 3.3.0, 3.2.2)

> Basic Authentication or Token Based Authentication for The REST Submission 
> Server
> -
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: Jack
>Priority: Major
>  Labels: authentication, rest, spark, spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> I appreciate that one could implement a custom proxy to provide this 
> authentication check, but it seems like a common use case that others may 
> benefit from to be able to authenticate against the rest submission endpoint, 
> and by implementing this capability as an optionally configurable aspect of 
> Spark itself, we can utilise the existing server to provide this check.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin needed some more advanced authentication check, such as RBAC et 
> al, it could be facilitated without the need for writing a complete custom 
> proxy layer; but I do feel there should be some basic built in available; eg. 
> RestSubmissionBasicAuthenticator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38862) Basic Authentication or Token Based Authentication for The REST Submission Server

2022-07-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38862:
--
Affects Version/s: 3.4.0
   (was: 3.0.0)
   (was: 3.1.0)
   (was: 3.0.1)
   (was: 3.0.2)
   (was: 3.2.0)
   (was: 3.1.1)
   (was: 3.1.2)
   (was: 3.0.3)
   (was: 3.2.1)

> Basic Authentication or Token Based Authentication for The REST Submission 
> Server
> -
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.4.0
>Reporter: Jack
>Priority: Major
>  Labels: authentication, rest, spark, spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> I appreciate that one could implement a custom proxy to provide this 
> authentication check, but it seems like a common use case that others may 
> benefit from to be able to authenticate against the rest submission endpoint, 
> and by implementing this capability as an optionally configurable aspect of 
> Spark itself, we can utilise the existing server to provide this check.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin needed some more advanced authentication check, such as RBAC et 
> al, it could be facilitated without the need for writing a complete custom 
> proxy layer; but I do feel there should be some basic built in available; eg. 
> RestSubmissionBasicAuthenticator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39736) Enable base image build in SparkR job

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565052#comment-17565052
 ] 

Apache Spark commented on SPARK-39736:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37158

> Enable base image build in SparkR job
> -
>
> Key: SPARK-39736
> URL: https://issues.apache.org/jira/browse/SPARK-39736
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39736) Enable base image build in SparkR job

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39736:


Assignee: Apache Spark

> Enable base image build in SparkR job
> -
>
> Key: SPARK-39736
> URL: https://issues.apache.org/jira/browse/SPARK-39736
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39736) Enable base image build in SparkR job

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39736:


Assignee: (was: Apache Spark)

> Enable base image build in SparkR job
> -
>
> Key: SPARK-39736
> URL: https://issues.apache.org/jira/browse/SPARK-39736
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39667) Add another workaround when there is not enough memory to build and broadcast the table

2022-07-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-39667.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37069
[https://github.com/apache/spark/pull/37069]

> Add another workaround when there is not enough memory to build and broadcast 
> the table
> ---
>
> Key: SPARK-39667
> URL: https://issues.apache.org/jira/browse/SPARK-39667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39667) Add another workaround when there is not enough memory to build and broadcast the table

2022-07-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-39667:
---

Assignee: Yuming Wang

> Add another workaround when there is not enough memory to build and broadcast 
> the table
> ---
>
> Key: SPARK-39667
> URL: https://issues.apache.org/jira/browse/SPARK-39667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39744) Add the REGEXP_INSTR function

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564934#comment-17564934
 ] 

Apache Spark commented on SPARK-39744:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37154

> Add the REGEXP_INSTR function
> -
>
> Key: SPARK-39744
> URL: https://issues.apache.org/jira/browse/SPARK-39744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The function should return the position of the specified occurrence of the 
> regular expression pattern in the input string. If no match is found, returns 
> 0. See other DBMSs:
>  - MariaDB: [https://mariadb.com/kb/en/regexp_instr/]
>  - Oracle: 
> [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm]
>  - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr]
>  - Snowflake: 
> [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html]
>  - BigQuery: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr]
>  - Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html]
>  - Exasol DB: 
> [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm]
> - Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39744) Add the REGEXP_INSTR function

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39744:


Assignee: Max Gekk  (was: Apache Spark)

> Add the REGEXP_INSTR function
> -
>
> Key: SPARK-39744
> URL: https://issues.apache.org/jira/browse/SPARK-39744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The function should return the position of the specified occurrence of the 
> regular expression pattern in the input string. If no match is found, returns 
> 0. See other DBMSs:
>  - MariaDB: [https://mariadb.com/kb/en/regexp_instr/]
>  - Oracle: 
> [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm]
>  - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr]
>  - Snowflake: 
> [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html]
>  - BigQuery: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr]
>  - Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html]
>  - Exasol DB: 
> [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm]
> - Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39744) Add the REGEXP_INSTR function

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39744:


Assignee: Apache Spark  (was: Max Gekk)

> Add the REGEXP_INSTR function
> -
>
> Key: SPARK-39744
> URL: https://issues.apache.org/jira/browse/SPARK-39744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The function should return the position of the specified occurrence of the 
> regular expression pattern in the input string. If no match is found, returns 
> 0. See other DBMSs:
>  - MariaDB: [https://mariadb.com/kb/en/regexp_instr/]
>  - Oracle: 
> [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm]
>  - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr]
>  - Snowflake: 
> [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html]
>  - BigQuery: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr]
>  - Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html]
>  - Exasol DB: 
> [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm]
> - Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39744) Add the REGEXP_INSTR function

2022-07-11 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39744:
-
Description: 
The function should return the position of the specified occurrence of the 
regular expression pattern in the input string. If no match is found, returns 
0. See other DBMSs:
 - MariaDB: [https://mariadb.com/kb/en/regexp_instr/]
 - Oracle: 
[https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm]
 - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr]
 - Snowflake: 
[https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html]
 - BigQuery: 
[https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr]
 - Redshift: [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html]
 - Exasol DB: 
[https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm]
- Vertica: 
[https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm]
 

  was:
The function should search a string for a regular expression pattern and 
returns it or NULL of it is not found. See other DBMSs:
- Oracle: 
https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm
- DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr
- Snowflake: 
https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html
- BigQuery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr
- Redshift: https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html
- MariaDB: https://mariadb.com/kb/en/regexp_substr/
- Exasol DB: 
https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm
 


> Add the REGEXP_INSTR function
> -
>
> Key: SPARK-39744
> URL: https://issues.apache.org/jira/browse/SPARK-39744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The function should return the position of the specified occurrence of the 
> regular expression pattern in the input string. If no match is found, returns 
> 0. See other DBMSs:
>  - MariaDB: [https://mariadb.com/kb/en/regexp_instr/]
>  - Oracle: 
> [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm]
>  - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr]
>  - Snowflake: 
> [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html]
>  - BigQuery: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr]
>  - Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html]
>  - Exasol DB: 
> [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm]
> - Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564930#comment-17564930
 ] 

Apache Spark commented on SPARK-39742:
--

User 'zml1206' has created a pull request for this issue:
https://github.com/apache/spark/pull/37156

> Request executor after kill executor, the number of executors is not as 
> expected
> 
>
> Key: SPARK-39742
> URL: https://issues.apache.org/jira/browse/SPARK-39742
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.2.1
>Reporter: zhuml
>Priority: Major
>
> I used the killExecutors and requestExecutors function of SparkContext to 
> dynamically adjust the resources, and found that the requestExecutors after 
> killExecutors could not achieve the expected results.
> Add unit tests in StandaloneDynamicAllocationSuite.scala 
> {code:java}
> test("kill executors first and then request") {
>     sc = new SparkContext(appConf
>       .set(config.EXECUTOR_CORES, 2)
>       .set(config.CORES_MAX, 8))
>     val appId = sc.applicationId
>     eventually(timeout(10.seconds), interval(10.millis)) {
>       val apps = getApplications()
>       assert(apps.size === 1)
>       assert(apps.head.id === appId)
>       assert(apps.head.executors.size === 4) // 8 cores total
>       assert(apps.head.getExecutorLimit === Int.MaxValue)
>     }
>     // sync executors between the Master and the driver, needed because
>     // the driver refuses to kill executors it does not know about
>     syncExecutors(sc)
>     val executors = getExecutorIds(sc)
>     assert(executors.size === 4)
>     // kill 2 executors
>     assert(sc.killExecutors(executors.take(3)))
>     val apps = getApplications()
>     assert(apps.head.executors.size === 1)
>     // add 2 executors
>     assert(sc.requestExecutors(3))
>     assert(apps.head.executors.size === 4)
>   } {code}
> 3 did not equal 4
> Expected :4
> Actual   :3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35242) Support change catalog default database for spark

2022-07-11 Thread Gabor Roczei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564931#comment-17564931
 ] 

Gabor Roczei commented on SPARK-35242:
--

Ok, thanks [~hongdongdong]!

> Support change catalog default database for spark
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39742:


Assignee: (was: Apache Spark)

> Request executor after kill executor, the number of executors is not as 
> expected
> 
>
> Key: SPARK-39742
> URL: https://issues.apache.org/jira/browse/SPARK-39742
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.2.1
>Reporter: zhuml
>Priority: Major
>
> I used the killExecutors and requestExecutors function of SparkContext to 
> dynamically adjust the resources, and found that the requestExecutors after 
> killExecutors could not achieve the expected results.
> Add unit tests in StandaloneDynamicAllocationSuite.scala 
> {code:java}
> test("kill executors first and then request") {
>     sc = new SparkContext(appConf
>       .set(config.EXECUTOR_CORES, 2)
>       .set(config.CORES_MAX, 8))
>     val appId = sc.applicationId
>     eventually(timeout(10.seconds), interval(10.millis)) {
>       val apps = getApplications()
>       assert(apps.size === 1)
>       assert(apps.head.id === appId)
>       assert(apps.head.executors.size === 4) // 8 cores total
>       assert(apps.head.getExecutorLimit === Int.MaxValue)
>     }
>     // sync executors between the Master and the driver, needed because
>     // the driver refuses to kill executors it does not know about
>     syncExecutors(sc)
>     val executors = getExecutorIds(sc)
>     assert(executors.size === 4)
>     // kill 2 executors
>     assert(sc.killExecutors(executors.take(3)))
>     val apps = getApplications()
>     assert(apps.head.executors.size === 1)
>     // add 2 executors
>     assert(sc.requestExecutors(3))
>     assert(apps.head.executors.size === 4)
>   } {code}
> 3 did not equal 4
> Expected :4
> Actual   :3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39742:


Assignee: Apache Spark

> Request executor after kill executor, the number of executors is not as 
> expected
> 
>
> Key: SPARK-39742
> URL: https://issues.apache.org/jira/browse/SPARK-39742
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.2.1
>Reporter: zhuml
>Assignee: Apache Spark
>Priority: Major
>
> I used the killExecutors and requestExecutors function of SparkContext to 
> dynamically adjust the resources, and found that the requestExecutors after 
> killExecutors could not achieve the expected results.
> Add unit tests in StandaloneDynamicAllocationSuite.scala 
> {code:java}
> test("kill executors first and then request") {
>     sc = new SparkContext(appConf
>       .set(config.EXECUTOR_CORES, 2)
>       .set(config.CORES_MAX, 8))
>     val appId = sc.applicationId
>     eventually(timeout(10.seconds), interval(10.millis)) {
>       val apps = getApplications()
>       assert(apps.size === 1)
>       assert(apps.head.id === appId)
>       assert(apps.head.executors.size === 4) // 8 cores total
>       assert(apps.head.getExecutorLimit === Int.MaxValue)
>     }
>     // sync executors between the Master and the driver, needed because
>     // the driver refuses to kill executors it does not know about
>     syncExecutors(sc)
>     val executors = getExecutorIds(sc)
>     assert(executors.size === 4)
>     // kill 2 executors
>     assert(sc.killExecutors(executors.take(3)))
>     val apps = getApplications()
>     assert(apps.head.executors.size === 1)
>     // add 2 executors
>     assert(sc.requestExecutors(3))
>     assert(apps.head.executors.size === 4)
>   } {code}
> 3 did not equal 4
> Expected :4
> Actual   :3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39744) Add the REGEXP_INSTR function

2022-07-11 Thread Max Gekk (Jira)
Max Gekk created SPARK-39744:


 Summary: Add the REGEXP_INSTR function
 Key: SPARK-39744
 URL: https://issues.apache.org/jira/browse/SPARK-39744
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0


The function should search a string for a regular expression pattern and 
returns it or NULL of it is not found. See other DBMSs:
- Oracle: 
https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm
- DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr
- Snowflake: 
https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html
- BigQuery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr
- Redshift: https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html
- MariaDB: https://mariadb.com/kb/en/regexp_substr/
- Exasol DB: 
https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39744) Add the REGEXP_INSTR function

2022-07-11 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39744:
-
Fix Version/s: (was: 3.4.0)

> Add the REGEXP_INSTR function
> -
>
> Key: SPARK-39744
> URL: https://issues.apache.org/jira/browse/SPARK-39744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The function should search a string for a regular expression pattern and 
> returns it or NULL of it is not found. See other DBMSs:
> - Oracle: 
> https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm
> - DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr
> - Snowflake: 
> https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html
> - BigQuery: 
> https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr
> - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html
> - MariaDB: https://mariadb.com/kb/en/regexp_substr/
> - Exasol DB: 
> https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39735) Enable base image build in lint job and fix sparkr env

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564916#comment-17564916
 ] 

Apache Spark commented on SPARK-39735:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37155

> Enable base image build in lint job and fix sparkr env
> --
>
> Key: SPARK-39735
> URL: https://issues.apache.org/jira/browse/SPARK-39735
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> Since sparkr 4.2.x has [below new 
> change](https://github.com/r-devel/r-svn/blob/e6be1f6b14838016e78d6a91f48f21acec7fa4c4/doc/NEWS.Rd#L376):
>   > Environment variables R_LIBS_USER and R_LIBS_SITE are both now set to the 
> R system default if unset or empty, and can be set to NULL to indicate an 
> empty list of user or site library directories.
> lastest ubuntu pkg also sync this change and has some changes on 
> /etc/R/Renviron ():
> ```
> $ docker run -ti 
> ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat 
> /etc/R/Renviron | grep R_LIBS_SITE
> R_LIBS_SITE=${R_LIBS_SITE:-'%S'}
> $ docker run -ti 
> ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat 
> /etc/R/Renviron.site | grep R_LIBS_SITE
> ## edd Jul 2007  Now use R_LIBS_SITE, not R_LIBS
> ## edd Mar 2022  Now in Renviron.site reflecting R_LIBS_SITE
> R_LIBS_SITE="/usr/local/lib/R/site-library/:${R_LIBS_SITE}:/usr/lib/R/library"
> ```
> So, we add `R_LIBS_SITE` to ENV from `/etc/R/Renviron.site` to make sure 
> search paths right for sparkr.
> otherwise, even if we install the `lintr` will cause error like due to 
> `R_LIBS_SITE` wrong set:
> ```
> $ dev/lint-r
> Loading required namespace: SparkR
> Loading required namespace: lintr
> Failed with error:  'there is no package called 'lintr''
> Installing package into '/usr/lib/R/site-library'
> (as 'lib' is unspecified)
> Error in contrib.url(repos, type) :
>   trying to use CRAN without setting a mirror
> Calls: install.packages -> startsWith -> contrib.url
> Execution halted
> ```
> [1] https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
> [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39739) Upgrade sbt to 1.7.0

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39739.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37151
[https://github.com/apache/spark/pull/37151]

> Upgrade sbt to 1.7.0
> 
>
> Key: SPARK-39739
> URL: https://issues.apache.org/jira/browse/SPARK-39739
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> https://eed3si9n.com/sbt-1.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39739) Upgrade sbt to 1.7.0

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39739:


Assignee: Yang Jie

> Upgrade sbt to 1.7.0
> 
>
> Key: SPARK-39739
> URL: https://issues.apache.org/jira/browse/SPARK-39739
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> https://eed3si9n.com/sbt-1.7.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-11 Thread Yeachan Park (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeachan Park updated SPARK-39743:
-
Summary: Unable to set zstd compression level while writing parquet files  
(was: Unable to set zstd compression level while writing parquet)

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39743) Unable to set zstd compression level while writing parquet

2022-07-11 Thread Yeachan Park (Jira)
Yeachan Park created SPARK-39743:


 Summary: Unable to set zstd compression level while writing parquet
 Key: SPARK-39743
 URL: https://issues.apache.org/jira/browse/SPARK-39743
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Yeachan Park


While writing zstd compressed parquet files, the following setting 
`spark.io.compression.zstd.level` does not have any affect with regards to the 
compression level of zstd.

All files seem to be written with the default zstd compression level, and the 
config option seems to be ignored.

Using the zstd cli tool, we confirmed that setting a higher compression level 
for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39619) PrometheusServlet: add "TYPE" comment to exposed metrics

2022-07-11 Thread Eric Barault (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564892#comment-17564892
 ] 

Eric Barault commented on SPARK-39619:
--

A PR was provided here: 
https://github.com/apache/spark/pull/37153

> PrometheusServlet: add "TYPE" comment to exposed metrics
> 
>
> Key: SPARK-39619
> URL: https://issues.apache.org/jira/browse/SPARK-39619
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Eric Barault
>Priority: Major
>
> The PrometheusServlet sink does not include the usual comments when exposing 
> the metrics in the prometheus format
> e.g. `# TYPE nginx_ingress_controller_ingress_upstream_latency_seconds 
> summary`
> which prevents some client/integrations that depend on them to assess the 
> metric type to work properly.
> For example the AWS cloudwatch agent prometheus plugin attempts to get the 
> metric type from the TYPE comment and considers any metric with no type as 
> unsuported and hence drops it. 
> As a result the cloudwatch agent prometheus drops all the metrics exposed by 
> the PrometheusServlet.
> [https://github.com/aws/amazon-cloudwatch-agent/blob/1f654cf69c1269073673ba2f636738c556248a31/plugins/inputs/prometheus_scraper/metrics_type_handler.go#L190]
>  
> This would be solved by adding the TYPE comments to the metrics exposed by 
> the PrometheusServlet sink.
>  
> _*references:*_
>  - [https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/]
>  - 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala]
>  - 
> [https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39721) Deprecate databaseName in listColumns if needed

2022-07-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-39721:
--
Summary: Deprecate databaseName in listColumns if needed  (was: Add 
getTable/getDatabase/getFunction in SparkR support 3L namespace)

> Deprecate databaseName in listColumns if needed
> ---
>
> Key: SPARK-39721
> URL: https://issues.apache.org/jira/browse/SPARK-39721
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39721) Deprecate databaseName in listColumns in SparkR if needed

2022-07-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-39721:
--
Summary: Deprecate databaseName in listColumns in SparkR if needed  (was: 
Deprecate databaseName in listColumns if needed)

> Deprecate databaseName in listColumns in SparkR if needed
> -
>
> Key: SPARK-39721
> URL: https://issues.apache.org/jira/browse/SPARK-39721
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-39721) Deprecate databaseName in listColumns in SparkR if needed

2022-07-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reopened SPARK-39721:
---

> Deprecate databaseName in listColumns in SparkR if needed
> -
>
> Key: SPARK-39721
> URL: https://issues.apache.org/jira/browse/SPARK-39721
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected

2022-07-11 Thread zhuml (Jira)
zhuml created SPARK-39742:
-

 Summary: Request executor after kill executor, the number of 
executors is not as expected
 Key: SPARK-39742
 URL: https://issues.apache.org/jira/browse/SPARK-39742
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.2.1
Reporter: zhuml


I used the killExecutors and requestExecutors function of SparkContext to 
dynamically adjust the resources, and found that the requestExecutors after 
killExecutors could not achieve the expected results.

Add unit tests in StandaloneDynamicAllocationSuite.scala 
{code:java}
test("kill executors first and then request") {
    sc = new SparkContext(appConf
      .set(config.EXECUTOR_CORES, 2)
      .set(config.CORES_MAX, 8))
    val appId = sc.applicationId
    eventually(timeout(10.seconds), interval(10.millis)) {
      val apps = getApplications()
      assert(apps.size === 1)
      assert(apps.head.id === appId)
      assert(apps.head.executors.size === 4) // 8 cores total
      assert(apps.head.getExecutorLimit === Int.MaxValue)
    }
    // sync executors between the Master and the driver, needed because
    // the driver refuses to kill executors it does not know about
    syncExecutors(sc)
    val executors = getExecutorIds(sc)
    assert(executors.size === 4)
    // kill 2 executors
    assert(sc.killExecutors(executors.take(3)))
    val apps = getApplications()
    assert(apps.head.executors.size === 1)
    // add 2 executors
    assert(sc.requestExecutors(3))
    assert(apps.head.executors.size === 4)
  } {code}
3 did not equal 4
Expected :4
Actual   :3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26052) Spark should output a _SUCCESS file for every partition correctly written

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564878#comment-17564878
 ] 

Apache Spark commented on SPARK-26052:
--

User 'danielhaviv' has created a pull request for this issue:
https://github.com/apache/spark/pull/37153

> Spark should output a _SUCCESS file for every partition correctly written
> -
>
> Key: SPARK-26052
> URL: https://issues.apache.org/jira/browse/SPARK-26052
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Matt Matolcsi
>Priority: Minor
>  Labels: bulk-closed
>
> When writing a set of partitioned Parquet files to HDFS using 
> dataframe.write.parquet(), a _SUCCESS file is written to hdfs://path/to/table 
> after successful completion, though the actual Parquet files will end up in 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ 
> If partitions are written out one at a time (e.g., an hourly ETL), the 
> _SUCCESS file is overwritten by each subsequent run and information on what 
> partitions were correctly written is lost.
> I would like to be able to keep track of what partitions were successfully 
> written in HDFS. I think this could be done by writing the _SUCCESS files to 
> the same partition directories where the Parquet files reside, i.e., 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/
> Since https://issues.apache.org/jira/browse/SPARK-13207 has been resolved, I 
> don't think this should break partition discovery.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26052) Spark should output a _SUCCESS file for every partition correctly written

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564877#comment-17564877
 ] 

Apache Spark commented on SPARK-26052:
--

User 'danielhaviv' has created a pull request for this issue:
https://github.com/apache/spark/pull/37153

> Spark should output a _SUCCESS file for every partition correctly written
> -
>
> Key: SPARK-26052
> URL: https://issues.apache.org/jira/browse/SPARK-26052
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Matt Matolcsi
>Priority: Minor
>  Labels: bulk-closed
>
> When writing a set of partitioned Parquet files to HDFS using 
> dataframe.write.parquet(), a _SUCCESS file is written to hdfs://path/to/table 
> after successful completion, though the actual Parquet files will end up in 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ 
> If partitions are written out one at a time (e.g., an hourly ETL), the 
> _SUCCESS file is overwritten by each subsequent run and information on what 
> partitions were correctly written is lost.
> I would like to be able to keep track of what partitions were successfully 
> written in HDFS. I think this could be done by writing the _SUCCESS files to 
> the same partition directories where the Parquet files reside, i.e., 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/
> Since https://issues.apache.org/jira/browse/SPARK-13207 has been resolved, I 
> don't think this should break partition discovery.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39741) Support url encode/decode as built-in function

2022-07-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564832#comment-17564832
 ] 

Apache Spark commented on SPARK-39741:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/37113

> Support url encode/decode as built-in function
> --
>
> Key: SPARK-39741
> URL: https://issues.apache.org/jira/browse/SPARK-39741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently, Spark don't support url encode/decode as built-in functions, the 
> user might use reflect instead, It's a bit of a hassle, And often these 
> functions are useful.
> This pr aims to add url encode/decode as built-in function support.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39741) Support url encode/decode as built-in function

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39741:


Assignee: (was: Apache Spark)

> Support url encode/decode as built-in function
> --
>
> Key: SPARK-39741
> URL: https://issues.apache.org/jira/browse/SPARK-39741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently, Spark don't support url encode/decode as built-in functions, the 
> user might use reflect instead, It's a bit of a hassle, And often these 
> functions are useful.
> This pr aims to add url encode/decode as built-in function support.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39741) Support url encode/decode as built-in function

2022-07-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39741:


Assignee: Apache Spark

> Support url encode/decode as built-in function
> --
>
> Key: SPARK-39741
> URL: https://issues.apache.org/jira/browse/SPARK-39741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently, Spark don't support url encode/decode as built-in functions, the 
> user might use reflect instead, It's a bit of a hassle, And often these 
> functions are useful.
> This pr aims to add url encode/decode as built-in function support.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39741) Support url encode/decode as built-in function

2022-07-11 Thread Yi kaifei (Jira)
Yi kaifei created SPARK-39741:
-

 Summary: Support url encode/decode as built-in function
 Key: SPARK-39741
 URL: https://issues.apache.org/jira/browse/SPARK-39741
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yi kaifei
 Fix For: 3.4.0


Currently, Spark don't support url encode/decode as built-in functions, the 
user might use reflect instead, It's a bit of a hassle, And often these 
functions are useful.

This pr aims to add url encode/decode as built-in function support.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39728:


Assignee: Andrew Ray

> Test for parity of SQL functions between Python and JVM DataFrame API's
> ---
>
> Key: SPARK-39728
> URL: https://issues.apache.org/jira/browse/SPARK-39728
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Assignee: Andrew Ray
>Priority: Minor
>
> Add a unit test that compares the available list of Python DataFrame 
> functions in pyspark.sql.functions with those available in the Scala/Java 
> DataFrame API in org.apache.spark.sql.functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39728.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37144
[https://github.com/apache/spark/pull/37144]

> Test for parity of SQL functions between Python and JVM DataFrame API's
> ---
>
> Key: SPARK-39728
> URL: https://issues.apache.org/jira/browse/SPARK-39728
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Assignee: Andrew Ray
>Priority: Minor
> Fix For: 3.4.0
>
>
> Add a unit test that compares the available list of Python DataFrame 
> functions in pyspark.sql.functions with those available in the Scala/Java 
> DataFrame API in org.apache.spark.sql.functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39735) Enable base image build in lint job and fix sparkr env

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39735.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37149
[https://github.com/apache/spark/pull/37149]

> Enable base image build in lint job and fix sparkr env
> --
>
> Key: SPARK-39735
> URL: https://issues.apache.org/jira/browse/SPARK-39735
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> Since sparkr 4.2.x has [below new 
> change](https://github.com/r-devel/r-svn/blob/e6be1f6b14838016e78d6a91f48f21acec7fa4c4/doc/NEWS.Rd#L376):
>   > Environment variables R_LIBS_USER and R_LIBS_SITE are both now set to the 
> R system default if unset or empty, and can be set to NULL to indicate an 
> empty list of user or site library directories.
> lastest ubuntu pkg also sync this change and has some changes on 
> /etc/R/Renviron ():
> ```
> $ docker run -ti 
> ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat 
> /etc/R/Renviron | grep R_LIBS_SITE
> R_LIBS_SITE=${R_LIBS_SITE:-'%S'}
> $ docker run -ti 
> ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat 
> /etc/R/Renviron.site | grep R_LIBS_SITE
> ## edd Jul 2007  Now use R_LIBS_SITE, not R_LIBS
> ## edd Mar 2022  Now in Renviron.site reflecting R_LIBS_SITE
> R_LIBS_SITE="/usr/local/lib/R/site-library/:${R_LIBS_SITE}:/usr/lib/R/library"
> ```
> So, we add `R_LIBS_SITE` to ENV from `/etc/R/Renviron.site` to make sure 
> search paths right for sparkr.
> otherwise, even if we install the `lintr` will cause error like due to 
> `R_LIBS_SITE` wrong set:
> ```
> $ dev/lint-r
> Loading required namespace: SparkR
> Loading required namespace: lintr
> Failed with error:  'there is no package called 'lintr''
> Installing package into '/usr/lib/R/site-library'
> (as 'lib' is unspecified)
> Error in contrib.url(repos, type) :
>   trying to use CRAN without setting a mirror
> Calls: install.packages -> startsWith -> contrib.url
> Execution halted
> ```
> [1] https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
> [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39735) Enable base image build in lint job and fix sparkr env

2022-07-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39735:


Assignee: Yikun Jiang

> Enable base image build in lint job and fix sparkr env
> --
>
> Key: SPARK-39735
> URL: https://issues.apache.org/jira/browse/SPARK-39735
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> Since sparkr 4.2.x has [below new 
> change](https://github.com/r-devel/r-svn/blob/e6be1f6b14838016e78d6a91f48f21acec7fa4c4/doc/NEWS.Rd#L376):
>   > Environment variables R_LIBS_USER and R_LIBS_SITE are both now set to the 
> R system default if unset or empty, and can be set to NULL to indicate an 
> empty list of user or site library directories.
> lastest ubuntu pkg also sync this change and has some changes on 
> /etc/R/Renviron ():
> ```
> $ docker run -ti 
> ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat 
> /etc/R/Renviron | grep R_LIBS_SITE
> R_LIBS_SITE=${R_LIBS_SITE:-'%S'}
> $ docker run -ti 
> ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat 
> /etc/R/Renviron.site | grep R_LIBS_SITE
> ## edd Jul 2007  Now use R_LIBS_SITE, not R_LIBS
> ## edd Mar 2022  Now in Renviron.site reflecting R_LIBS_SITE
> R_LIBS_SITE="/usr/local/lib/R/site-library/:${R_LIBS_SITE}:/usr/lib/R/library"
> ```
> So, we add `R_LIBS_SITE` to ENV from `/etc/R/Renviron.site` to make sure 
> search paths right for sparkr.
> otherwise, even if we install the `lintr` will cause error like due to 
> `R_LIBS_SITE` wrong set:
> ```
> $ dev/lint-r
> Loading required namespace: SparkR
> Loading required namespace: lintr
> Failed with error:  'there is no package called 'lintr''
> Installing package into '/usr/lib/R/site-library'
> (as 'lib' is unspecified)
> Error in contrib.url(repos, type) :
>   trying to use CRAN without setting a mirror
> Calls: install.packages -> startsWith -> contrib.url
> Execution halted
> ```
> [1] https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
> [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39740) vis-timeline @ 4.2.1 vulnerable to XSS attacks

2022-07-11 Thread Eugene Shinn (Truveta) (Jira)
Eugene Shinn (Truveta) created SPARK-39740:
--

 Summary: vis-timeline @ 4.2.1 vulnerable to XSS attacks
 Key: SPARK-39740
 URL: https://issues.apache.org/jira/browse/SPARK-39740
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.3.0, 3.2.1
Reporter: Eugene Shinn (Truveta)


Spark UI includes visjs/vis-timeline package@4.2.1, which is vulnerable to XSS 
attacks ([Cross-site Scripting in vis-timeline · CVE-2020-28487 · GitHub 
Advisory Database|https://github.com/advisories/GHSA-9mrv-456v-pf22]). This 
version should be replaced with the next non-vulnerable issue - [Release v7.4.4 
· visjs/vis-timeline 
(github.com)|https://github.com/visjs/vis-timeline/releases/tag/v7.4.4] or 
higher.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org