[jira] [Commented] (SPARK-39750) Enable spark.sql.cbo.enabled by default
[ https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565291#comment-17565291 ] Apache Spark commented on SPARK-39750: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37163 > Enable spark.sql.cbo.enabled by default > --- > > Key: SPARK-39750 > URL: https://issues.apache.org/jira/browse/SPARK-39750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39750) Enable spark.sql.cbo.enabled by default
[ https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39750: Assignee: (was: Apache Spark) > Enable spark.sql.cbo.enabled by default > --- > > Key: SPARK-39750 > URL: https://issues.apache.org/jira/browse/SPARK-39750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39750) Enable spark.sql.cbo.enabled by default
[ https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565289#comment-17565289 ] Apache Spark commented on SPARK-39750: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37163 > Enable spark.sql.cbo.enabled by default > --- > > Key: SPARK-39750 > URL: https://issues.apache.org/jira/browse/SPARK-39750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39750) Enable spark.sql.cbo.enabled by default
[ https://issues.apache.org/jira/browse/SPARK-39750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39750: Assignee: Apache Spark > Enable spark.sql.cbo.enabled by default > --- > > Key: SPARK-39750 > URL: https://issues.apache.org/jira/browse/SPARK-39750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39750) Enable spark.sql.cbo.enabled by default
Yuming Wang created SPARK-39750: --- Summary: Enable spark.sql.cbo.enabled by default Key: SPARK-39750 URL: https://issues.apache.org/jira/browse/SPARK-39750 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39647) Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when the NodeManager hasn't been
[ https://issues.apache.org/jira/browse/SPARK-39647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-39647. - Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37052 [https://github.com/apache/spark/pull/37052] > Block push fails with java.lang.IllegalArgumentException: Active local dirs > list has not been updated by any executor registration even when the > NodeManager hasn't been restarted > -- > > Key: SPARK-39647 > URL: https://issues.apache.org/jira/browse/SPARK-39647 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.3.1, 3.4.0 > > > We saw these exceptions during block push: > {code:java} > 22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block > shuffle_170_568_174, and will not retry (0 retries) > org.apache.spark.network.shuffle.BlockPushException: > !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException: > Active local dirs list has not been updated by any executor registration > at > org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168) > 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 > to BlockManagerId(, node-x, 7337, None) failed. > {code} > Note: The NodeManager on node-x (node against which this exception was seen) > was not restarted. > The reason this happened is because the executor registers the block manager > with {{BlockManagerMaster}} before it registers with the ESS. In push-based > shuffle, a block manager is selected by the driver as a merger for the > shuffle push. However, the ESS on that node can successfully merge the block > only if it has received the metadata about merged directories from the local > executor (sent when the local executor registers with the ESS). If this local > executor registration is delayed, but the ESS host got picked up as a merger > then it will fail to merge the blocks pushed to it which is what happened > here. > The local executor on node-x is executor 754 and the block manager > registration happened at 13:28:11 > {code:java} > 22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has > registered (new total is 1200) > 22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager > node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None) > {code} > The application got registered with shuffle server at node-x at 13:29:40 > {code:java} > 2022-06-24 13:29:40,343 INFO > org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active > local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, > /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, > /grid/c/tmp/yarn/] for application application_1653753500486_3193550 > {code} > node-x was selected as a merger by the driver after 13:28:11 and when the > executors started pushing to it, all those pushes failed until 13:29:40 > We can fix by having the executor register with ESS before it registers the > block manager with the {{BlockManagerMaster}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39647) Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when the NodeManager hasn't been
[ https://issues.apache.org/jira/browse/SPARK-39647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-39647: --- Assignee: Chandni Singh > Block push fails with java.lang.IllegalArgumentException: Active local dirs > list has not been updated by any executor registration even when the > NodeManager hasn't been restarted > -- > > Key: SPARK-39647 > URL: https://issues.apache.org/jira/browse/SPARK-39647 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > We saw these exceptions during block push: > {code:java} > 22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block > shuffle_170_568_174, and will not retry (0 retries) > org.apache.spark.network.shuffle.BlockPushException: > !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException: > Active local dirs list has not been updated by any executor registration > at > org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168) > 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 > to BlockManagerId(, node-x, 7337, None) failed. > {code} > Note: The NodeManager on node-x (node against which this exception was seen) > was not restarted. > The reason this happened is because the executor registers the block manager > with {{BlockManagerMaster}} before it registers with the ESS. In push-based > shuffle, a block manager is selected by the driver as a merger for the > shuffle push. However, the ESS on that node can successfully merge the block > only if it has received the metadata about merged directories from the local > executor (sent when the local executor registers with the ESS). If this local > executor registration is delayed, but the ESS host got picked up as a merger > then it will fail to merge the blocks pushed to it which is what happened > here. > The local executor on node-x is executor 754 and the block manager > registration happened at 13:28:11 > {code:java} > 22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has > registered (new total is 1200) > 22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager > node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None) > {code} > The application got registered with shuffle server at node-x at 13:29:40 > {code:java} > 2022-06-24 13:29:40,343 INFO > org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active > local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, > /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, > /grid/c/tmp/yarn/] for application application_1653753500486_3193550 > {code} > node-x was selected as a merger by the driver after 13:28:11 and when the > executors started pushing to it, all those pushes failed until 13:29:40 > We can fix by having the executor register with ESS before it registers the > block manager with the {{BlockManagerMaster}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565243#comment-17565243 ] Apache Spark commented on SPARK-38910: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/37162 > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565242#comment-17565242 ] Apache Spark commented on SPARK-38910: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/37162 > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39723) Implement functionExists/getFunc in SparkR for 3L namespace
[ https://issues.apache.org/jira/browse/SPARK-39723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-39723. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37135 [https://github.com/apache/spark/pull/37135] > Implement functionExists/getFunc in SparkR for 3L namespace > --- > > Key: SPARK-39723 > URL: https://issues.apache.org/jira/browse/SPARK-39723 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39723) Implement functionExists/getFunc in SparkR for 3L namespace
[ https://issues.apache.org/jira/browse/SPARK-39723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-39723: - Assignee: Ruifeng Zheng > Implement functionExists/getFunc in SparkR for 3L namespace > --- > > Key: SPARK-39723 > URL: https://issues.apache.org/jira/browse/SPARK-39723 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39749) Always use plain string representation on casting Decimal to String
[ https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-39749: --- Summary: Always use plain string representation on casting Decimal to String (was: Use plain string representation on casting Decimal to String) > Always use plain string representation on casting Decimal to String > --- > > Key: SPARK-39749 > URL: https://issues.apache.org/jira/browse/SPARK-39749 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, casting decimal as string type will result in Strings with > exponential notations if the adjusted exponent is less than -6. This is > consistent with BigDecimal.toString > [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString] > > This is different from external databases like PostgreSQL/Oracle/MS SQL > server. It doesn't compliant with the ANSI SQL standard either. > I suggest always using the plain string in the casting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-39748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39748: Assignee: (was: Apache Spark) > Include the origin logical plan for LogicalRDD if it comes from DataFrame > - > > Key: SPARK-39748 > URL: https://issues.apache.org/jira/browse/SPARK-39748 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > When Spark converts the DataFrame to LogicalRDD for some reason (e.g. > foreachBatch sink), Spark just picks the RDD from the origin DataFrame and > discards the (logical/physical) plan. > The origin logical plan can be useful for several use cases, including: > 1. wants to connect the overall logical plan into one > 2. inherits plan stats from origin logical plan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-39748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39748: Assignee: Apache Spark > Include the origin logical plan for LogicalRDD if it comes from DataFrame > - > > Key: SPARK-39748 > URL: https://issues.apache.org/jira/browse/SPARK-39748 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > When Spark converts the DataFrame to LogicalRDD for some reason (e.g. > foreachBatch sink), Spark just picks the RDD from the origin DataFrame and > discards the (logical/physical) plan. > The origin logical plan can be useful for several use cases, including: > 1. wants to connect the overall logical plan into one > 2. inherits plan stats from origin logical plan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-39748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565236#comment-17565236 ] Apache Spark commented on SPARK-39748: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37161 > Include the origin logical plan for LogicalRDD if it comes from DataFrame > - > > Key: SPARK-39748 > URL: https://issues.apache.org/jira/browse/SPARK-39748 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > When Spark converts the DataFrame to LogicalRDD for some reason (e.g. > foreachBatch sink), Spark just picks the RDD from the origin DataFrame and > discards the (logical/physical) plan. > The origin logical plan can be useful for several use cases, including: > 1. wants to connect the overall logical plan into one > 2. inherits plan stats from origin logical plan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39749) Use plain string representation on casting Decimal to String
[ https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565235#comment-17565235 ] Apache Spark commented on SPARK-39749: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37160 > Use plain string representation on casting Decimal to String > > > Key: SPARK-39749 > URL: https://issues.apache.org/jira/browse/SPARK-39749 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, casting decimal as string type will result in Strings with > exponential notations if the adjusted exponent is less than -6. This is > consistent with BigDecimal.toString > [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString] > > This is different from external databases like PostgreSQL/Oracle/MS SQL > server. It doesn't compliant with the ANSI SQL standard either. > I suggest always using the plain string in the casting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39749) Use plain string representation on casting Decimal to String
[ https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39749: Assignee: Apache Spark (was: Gengliang Wang) > Use plain string representation on casting Decimal to String > > > Key: SPARK-39749 > URL: https://issues.apache.org/jira/browse/SPARK-39749 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Currently, casting decimal as string type will result in Strings with > exponential notations if the adjusted exponent is less than -6. This is > consistent with BigDecimal.toString > [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString] > > This is different from external databases like PostgreSQL/Oracle/MS SQL > server. It doesn't compliant with the ANSI SQL standard either. > I suggest always using the plain string in the casting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39749) Use plain string representation on casting Decimal to String
[ https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565234#comment-17565234 ] Apache Spark commented on SPARK-39749: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37160 > Use plain string representation on casting Decimal to String > > > Key: SPARK-39749 > URL: https://issues.apache.org/jira/browse/SPARK-39749 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, casting decimal as string type will result in Strings with > exponential notations if the adjusted exponent is less than -6. This is > consistent with BigDecimal.toString > [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString] > > This is different from external databases like PostgreSQL/Oracle/MS SQL > server. It doesn't compliant with the ANSI SQL standard either. > I suggest always using the plain string in the casting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39749) Use plain string representation on casting Decimal to String
[ https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39749: Assignee: Gengliang Wang (was: Apache Spark) > Use plain string representation on casting Decimal to String > > > Key: SPARK-39749 > URL: https://issues.apache.org/jira/browse/SPARK-39749 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, casting decimal as string type will result in Strings with > exponential notations if the adjusted exponent is less than -6. This is > consistent with BigDecimal.toString > [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString] > > This is different from external databases like PostgreSQL/Oracle/MS SQL > server. It doesn't compliant with the ANSI SQL standard either. > I suggest always using the plain string in the casting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39711) Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging
[ https://issues.apache.org/jira/browse/SPARK-39711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-39711. Fix Version/s: 3.4.0 Assignee: BingKun Pan Resolution: Fixed > Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging > > > Key: SPARK-39711 > URL: https://issues.apache.org/jira/browse/SPARK-39711 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > SparkFunSuite declare as follow: > {code:java} > abstract class SparkFunSuite > extends AnyFunSuite > with BeforeAndAfterAll > with BeforeAndAfterEach > with ThreadAudit > with Logging > {code} > some suite extends SparkFunSuite and meanwhile with BeforeAndAfterAll or > BeforeAndAfterEach or Logging, it is redundant. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39749) Use plain string representation on casting Decimal to String
Gengliang Wang created SPARK-39749: -- Summary: Use plain string representation on casting Decimal to String Key: SPARK-39749 URL: https://issues.apache.org/jira/browse/SPARK-39749 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Currently, casting decimal as string type will result in Strings with exponential notations if the adjusted exponent is less than -6. This is consistent with BigDecimal.toString [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString] This is different from external databases like PostgreSQL/Oracle/MS SQL server. It doesn't compliant with the ANSI SQL standard either. I suggest always using the plain string in the casting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39737) PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter
[ https://issues.apache.org/jira/browse/SPARK-39737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39737. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37150 [https://github.com/apache/spark/pull/37150] > PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter > > > Key: SPARK-39737 > URL: https://issues.apache.org/jira/browse/SPARK-39737 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > Currently, Spark support ANSI Aggregation Function percentile_cont and > percentile_disc. > But the two aggregate functions does not support aggregate filter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39737) PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter
[ https://issues.apache.org/jira/browse/SPARK-39737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39737: --- Assignee: jiaan.geng > PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter > > > Key: SPARK-39737 > URL: https://issues.apache.org/jira/browse/SPARK-39737 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Currently, Spark support ANSI Aggregation Function percentile_cont and > percentile_disc. > But the two aggregate functions does not support aggregate filter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame
Jungtaek Lim created SPARK-39748: Summary: Include the origin logical plan for LogicalRDD if it comes from DataFrame Key: SPARK-39748 URL: https://issues.apache.org/jira/browse/SPARK-39748 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim When Spark converts the DataFrame to LogicalRDD for some reason (e.g. foreachBatch sink), Spark just picks the RDD from the origin DataFrame and discards the (logical/physical) plan. The origin logical plan can be useful for several use cases, including: 1. wants to connect the overall logical plan into one 2. inherits plan stats from origin logical plan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39741) Support url encode/decode as built-in function
[ https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-39741: Fix Version/s: (was: 3.4.0) > Support url encode/decode as built-in function > -- > > Key: SPARK-39741 > URL: https://issues.apache.org/jira/browse/SPARK-39741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Minor > > Currently, Spark don't support url encode/decode as built-in functions, the > user might use reflect instead, It's a bit of a hassle, And often these > functions are useful. > This pr aims to add url encode/decode as built-in function support. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39723) Implement functionExists/getFunc in SparkR for 3L namespace
[ https://issues.apache.org/jira/browse/SPARK-39723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-39723: -- Summary: Implement functionExists/getFunc in SparkR for 3L namespace (was: Implement functionExists/getFunction in SparkR for 3L namespace) > Implement functionExists/getFunc in SparkR for 3L namespace > --- > > Key: SPARK-39723 > URL: https://issues.apache.org/jira/browse/SPARK-39723 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39736) Enable base image build in SparkR job
[ https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39736: Assignee: Yikun Jiang > Enable base image build in SparkR job > - > > Key: SPARK-39736 > URL: https://issues.apache.org/jira/browse/SPARK-39736 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39736) Enable base image build in SparkR job
[ https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39736. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37158 [https://github.com/apache/spark/pull/37158] > Enable base image build in SparkR job > - > > Key: SPARK-39736 > URL: https://issues.apache.org/jira/browse/SPARK-39736 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39747) pandas and pandas on Spark API parameter naming difference
Chenyang Zhang created SPARK-39747: -- Summary: pandas and pandas on Spark API parameter naming difference Key: SPARK-39747 URL: https://issues.apache.org/jira/browse/SPARK-39747 Project: Spark Issue Type: Improvement Components: Pandas API on Spark Affects Versions: 3.3.0 Reporter: Chenyang Zhang I noticed there are some parameter naming differences between pandas and pandas on Spark. For example, in "read_csv", the path parameter is "filepath_or_buffer" for pandas and "path" for pandas on Spark. I wonder why such a difference exists and may I ask to change it to match exactly the same in pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39696) Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration
[ https://issues.apache.org/jira/browse/SPARK-39696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Mcmullan updated SPARK-39696: - Fix Version/s: (was: 3.3.1) Target Version/s: (was: 3.3.0) > Uncaught exception in thread executor-heartbeater > java.util.ConcurrentModificationException: mutation occurred during iteration > --- > > Key: SPARK-39696 > URL: https://issues.apache.org/jira/browse/SPARK-39696 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 > Environment: Spark 3.3.0 (spark-3.3.0-bin-hadoop3-scala2.13 > distribution) > Scala 2.13.8 / OpenJDK 17.0.3 application compilation > Alpine Linux 3.14.3 > JVM OpenJDK 64-Bit Server VM Temurin-17.0.1+12 >Reporter: Stephen Mcmullan >Priority: Major > > {noformat} > 2022-06-21 18:17:49.289Z ERROR [executor-heartbeater] > org.apache.spark.util.Utils - Uncaught exception in thread > executor-heartbeater > java.util.ConcurrentModificationException: mutation occurred during iteration > at > scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:873) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:869) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:852) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:852) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.immutable.VectorStatics$.append1IfSpace(Vector.scala:1959) > ~[scala-library-2.13.8.jar:?] > at scala.collection.immutable.Vector1.appendedAll0(Vector.scala:425) > ~[scala-library-2.13.8.jar:?] > at scala.collection.immutable.Vector.appendedAll(Vector.scala:203) > ~[scala-library-2.13.8.jar:?] > at scala.collection.immutable.Vector.appendedAll(Vector.scala:113) > ~[scala-library-2.13.8.jar:?] > at scala.collection.SeqOps.concat(Seq.scala:187) > ~[scala-library-2.13.8.jar:?] > at scala.collection.SeqOps.concat$(Seq.scala:187) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractSeq.concat(Seq.scala:1161) > ~[scala-library-2.13.8.jar:?] > at scala.collection.IterableOps.$plus$plus(Iterable.scala:726) > ~[scala-library-2.13.8.jar:?] > at scala.collection.IterableOps.$plus$plus$(Iterable.scala:726) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterable.$plus$plus(Iterable.scala:926) > ~[scala-library-2.13.8.jar:?] > at > org.apache.spark.executor.TaskMetrics.accumulators(TaskMetrics.scala:261) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$1(Executor.scala:1042) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) > ~[scala-library-2.13.8.jar:?] > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterable.foreach(Iterable.scala:926) > ~[scala-library-2.13.8.jar:?] > at > org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1036) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) > ~[scala-library-2.13.8.jar:?] > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) > ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > ~[?:?] > at > ja
[jira] [Created] (SPARK-39746) Binary array operations can be faster if one side is a constant
David Vogelbacher created SPARK-39746: - Summary: Binary array operations can be faster if one side is a constant Key: SPARK-39746 URL: https://issues.apache.org/jira/browse/SPARK-39746 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: David Vogelbacher Array operations such as [ArraysOverlap|https://github.com/apache/spark/blob/79f133b7bbc1d9aa6a20dd8a34ec120902f96155/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1367] are optimized to put all the elements of the smaller array into a HashSet, if elements properly support equals. However, if one of the arrays is a constant, we could do much better as we don't have to reconstruct the HashSet for each row, we could construct it just once and send it to all the executors. This would improve runtime by a constant factor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39745) Accept a list that contains NumPy scalars in `createDataFrame`
Xinrong Meng created SPARK-39745: Summary: Accept a list that contains NumPy scalars in `createDataFrame` Key: SPARK-39745 URL: https://issues.apache.org/jira/browse/SPARK-39745 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Currently, only lists of native Python scalars are accepted in `createDataFrame`. We should support Numpy scalars as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39494: - Description: Currently, DataFrame creation from a list of native Python scalars is unsupported in PySpark, for example, {{>>> spark.createDataFrame([1, 2]).collect()}} {{Traceback (most recent call last):}} {{...}} {{TypeError: Can not infer schema for type: }} {{However, Spark DataFrame Scala API supports that:}} {{scala> Seq(1, 2).toDF().collect()}} {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). was: {{>>> spark.createDataFrame([1, 2]).collect()}} {{Traceback (most recent call last):}} {{...}} {{TypeError: Can not infer schema for type: }} {{However, Spark DataFrame Scala API supports that:}} {{scala> Seq(1, 2).toDF().collect()}} {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). > Support `createDataFrame` from a list of scalars when schema is not provided > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of native Python scalars is > unsupported in PySpark, for example, > {{>>> spark.createDataFrame([1, 2]).collect()}} > {{Traceback (most recent call last):}} > {{...}} > {{TypeError: Can not infer schema for type: }} > {{However, Spark DataFrame Scala API supports that:}} > {{scala> Seq(1, 2).toDF().collect()}} > {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. > See more > [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39494: - Description: {{>>> spark.createDataFrame([1, 2]).collect()}} {{Traceback (most recent call last):}} {{...}} {{TypeError: Can not infer schema for type: }} {{However, Spark DataFrame Scala API supports that:}} {{scala> Seq(1, 2).toDF().collect()}} {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). was: {{Currently, DataFrame creation from a list of native Python scalars is unsupported in PySpark, for example,}} {{>>> spark.createDataFrame([1, 2]).collect()}} {{Traceback (most recent call last):}} {{...}} {{TypeError: Can not infer schema for type: }} {{However, Spark DataFrame Scala API supports that:}} {{scala> Seq(1, 2).toDF().collect()}} {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). > Support `createDataFrame` from a list of scalars when schema is not provided > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > > {{>>> spark.createDataFrame([1, 2]).collect()}} > {{Traceback (most recent call last):}} > {{...}} > {{TypeError: Can not infer schema for type: }} > {{However, Spark DataFrame Scala API supports that:}} > {{scala> Seq(1, 2).toDF().collect()}} > {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. > See more > [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39494: - Description: {{Currently, DataFrame creation from a list of native Python scalars is unsupported in PySpark, for example,}} {{>>> spark.createDataFrame([1, 2]).collect()}} {{Traceback (most recent call last):}} {{...}} {{TypeError: Can not infer schema for type: }} {{However, Spark DataFrame Scala API supports that:}} {{scala> Seq(1, 2).toDF().collect()}} {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). was: {{Currently, DataFrame creation from a list of scalars is unsupported in PySpark, for example,}} {{>>> spark.createDataFrame([1, 2]).collect()}} {{Traceback (most recent call last):}} {{...}} {{TypeError: Can not infer schema for type: }} {{However, Spark DataFrame Scala API supports that:}} {{scala> Seq(1, 2).toDF().collect()}} {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). > Support `createDataFrame` from a list of scalars when schema is not provided > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > {{Currently, DataFrame creation from a list of native Python scalars is > unsupported in PySpark, for example,}} > {{>>> spark.createDataFrame([1, 2]).collect()}} > {{Traceback (most recent call last):}} > {{...}} > {{TypeError: Can not infer schema for type: }} > {{However, Spark DataFrame Scala API supports that:}} > {{scala> Seq(1, 2).toDF().collect()}} > {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. > See more > [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38796) Implement the to_number and try_to_number SQL functions according to a new specification
[ https://issues.apache.org/jira/browse/SPARK-38796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565088#comment-17565088 ] Apache Spark commented on SPARK-38796: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37159 > Implement the to_number and try_to_number SQL functions according to a new > specification > > > Key: SPARK-38796 > URL: https://issues.apache.org/jira/browse/SPARK-38796 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.3.0 > > > This tracks implementing the 'to_number' and 'try_to_number' SQL function > expressions according to new semantics described below. The former is > equivalent to the latter except that it throws an exception instead of > returning NULL for cases where the input string does not match the format > string. > > --- > > *try_to_number function (expr, fmt):* > Returns 'expr' cast to DECIMAL using formatting 'fmt', or 'NULL' if 'expr' is > not a valid match for the given format. > > Syntax: > [ S ] [ L | $ ] > [ 0 | 9 | G | , ] [...] > [ . | D ] > [ 0 | 9 ] [...] > [ L | $ ] [ PR | MI | S ] ' } > > *Arguments:* > 'expr': A STRING expression representing a number. 'expr' may include leading > or trailing spaces. > 'fmt': An STRING literal, specifying the expected format of 'expr'. > > *Returns:* > A DECIMAL(p, s) where 'p' is the total number of digits ('0' or '9') and 's' > is the number of digits after the decimal point, or 0 if there is none. > > *Format elements allowed (case insensitive):* > * 0 or 9 > Specifies an expected digit between '0' and '9'. > A '0' to the left of the decimal points indicates that 'expr' must have at > least as many digits. A leading '9' indicates that 'expr' may omit these > digits. > 'expr' must not be larger than the number of digits to the left of the > decimal point allowed by the format string. > Digits to the right of the decimal point in the format string indicate the > most digits that 'expr' may have to the right of the decimal point. > * . or D > Specifies the position of the decimal point. > 'expr' does not need to include a decimal point. > * , or G > Specifies the position of the ',' grouping (thousands) separator. > There must be a '0' or '9' to the left of the rightmost grouping separator. > 'expr' must match the grouping separator relevant for the size of the > number. > * $ > Specifies the location of the '$' currency sign. This character may only be > specified once. > * S > Specifies the position of an option '+' or '-' sign. This character may > only be specified once. > * MI > Specifies that 'expr' has an optional '-' sign at the end, but no '+'. > * PR > Specifies that 'expr' indicates a negative number with wrapping angled > brackets ('<1>'). If 'expr' contains any characters other than '0' through > '9' and those permitted in 'fmt' a 'NULL' is returned. > > *Examples:* > {{– The format expects:}} > {{– * an optional sign at the beginning,}} > {{– * followed by a dollar sign,}} > {{– * followed by a number between 3 and 6 digits long,}} > {{– * thousands separators,}} > {{– * up to two dights beyond the decimal point. }} > {{> SELECT try_to_number('-$12,345.67', 'S$999,099.99');}} > {{ -12345.67}} > {{– The plus sign is optional, and so are fractional digits.}} > {{> SELECT try_to_number('$345', 'S$999,099.99');}} > {{ 345.00}} > {{– The format requires at least three digits.}} > {{> SELECT try_to_number('$45', 'S$999,099.99');}} > {{ NULL}} > {{– The format requires at least three digits.}} > {{> SELECT try_to_number('$045', 'S$999,099.99');}} > {{ 45.00}} > {{– Using brackets to denote negative values}} > {{> SELECT try_to_number('<1234>', '99PR');}} > {{ -1234}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38999) Refactor DataSourceScanExec code to
[ https://issues.apache.org/jira/browse/SPARK-38999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38999: -- Target Version/s: (was: 3.3.0, 3.2.2, 3.4.0) > Refactor DataSourceScanExec code to > > > Key: SPARK-38999 > URL: https://issues.apache.org/jira/browse/SPARK-38999 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Utkarsh Agarwal >Assignee: Utkarsh Agarwal >Priority: Major > Fix For: 3.4.0 > > > Currently the code for `FileSourceScanExec` class, the physical node for the > file scans is quite complex and lengthy. The class should be refactored into > a trait `FileSourceScanLike` which implements basic functionality like > metrics and file listing. The execution specific code can then live inside > `FileSourceScanExec` which will subclass `FileSourceScanLike`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38862) Basic Authentication or Token Based Authentication for The REST Submission Server
[ https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565057#comment-17565057 ] Dongjoon Hyun commented on SPARK-38862: --- I removed the invalid versions from `Affected Versions` and `Target Versions` field. > Basic Authentication or Token Based Authentication for The REST Submission > Server > - > > Key: SPARK-38862 > URL: https://issues.apache.org/jira/browse/SPARK-38862 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Submit >Affects Versions: 3.4.0 >Reporter: Jack >Priority: Major > Labels: authentication, rest, spark, spark-submit, submit > > [Spark documentation|https://spark.apache.org/docs/latest/security.html] > states that > ??The REST Submission Server and the MesosClusterDispatcher do not support > authentication. You should ensure that all network access to the REST API & > MesosClusterDispatcher (port 6066 and 7077 respectively by default) are > restricted to hosts that are trusted to submit jobs.?? > Whilst it is true that we can use network policies to restrict access to our > exposed submission endpoint, it would be preferable to at least also allow > some primitive form of authentication at a global level, whether this is by > some token provided to the runtime environment or is a "system user" using > basic authentication of a username/password combination - I am not strictly > opinionated and I think either would suffice. > I appreciate that one could implement a custom proxy to provide this > authentication check, but it seems like a common use case that others may > benefit from to be able to authenticate against the rest submission endpoint, > and by implementing this capability as an optionally configurable aspect of > Spark itself, we can utilise the existing server to provide this check. > I would imagine that whatever solution is agreed for a first phase, a custom > authenticator may be something we want a user to be able to provide so that > if an admin needed some more advanced authentication check, such as RBAC et > al, it could be facilitated without the need for writing a complete custom > proxy layer; but I do feel there should be some basic built in available; eg. > RestSubmissionBasicAuthenticator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38862) Basic Authentication or Token Based Authentication for The REST Submission Server
[ https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38862: -- Target Version/s: (was: 3.3.0, 3.2.2) > Basic Authentication or Token Based Authentication for The REST Submission > Server > - > > Key: SPARK-38862 > URL: https://issues.apache.org/jira/browse/SPARK-38862 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Submit >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: Jack >Priority: Major > Labels: authentication, rest, spark, spark-submit, submit > > [Spark documentation|https://spark.apache.org/docs/latest/security.html] > states that > ??The REST Submission Server and the MesosClusterDispatcher do not support > authentication. You should ensure that all network access to the REST API & > MesosClusterDispatcher (port 6066 and 7077 respectively by default) are > restricted to hosts that are trusted to submit jobs.?? > Whilst it is true that we can use network policies to restrict access to our > exposed submission endpoint, it would be preferable to at least also allow > some primitive form of authentication at a global level, whether this is by > some token provided to the runtime environment or is a "system user" using > basic authentication of a username/password combination - I am not strictly > opinionated and I think either would suffice. > I appreciate that one could implement a custom proxy to provide this > authentication check, but it seems like a common use case that others may > benefit from to be able to authenticate against the rest submission endpoint, > and by implementing this capability as an optionally configurable aspect of > Spark itself, we can utilise the existing server to provide this check. > I would imagine that whatever solution is agreed for a first phase, a custom > authenticator may be something we want a user to be able to provide so that > if an admin needed some more advanced authentication check, such as RBAC et > al, it could be facilitated without the need for writing a complete custom > proxy layer; but I do feel there should be some basic built in available; eg. > RestSubmissionBasicAuthenticator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38862) Basic Authentication or Token Based Authentication for The REST Submission Server
[ https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38862: -- Affects Version/s: 3.4.0 (was: 3.0.0) (was: 3.1.0) (was: 3.0.1) (was: 3.0.2) (was: 3.2.0) (was: 3.1.1) (was: 3.1.2) (was: 3.0.3) (was: 3.2.1) > Basic Authentication or Token Based Authentication for The REST Submission > Server > - > > Key: SPARK-38862 > URL: https://issues.apache.org/jira/browse/SPARK-38862 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Submit >Affects Versions: 3.4.0 >Reporter: Jack >Priority: Major > Labels: authentication, rest, spark, spark-submit, submit > > [Spark documentation|https://spark.apache.org/docs/latest/security.html] > states that > ??The REST Submission Server and the MesosClusterDispatcher do not support > authentication. You should ensure that all network access to the REST API & > MesosClusterDispatcher (port 6066 and 7077 respectively by default) are > restricted to hosts that are trusted to submit jobs.?? > Whilst it is true that we can use network policies to restrict access to our > exposed submission endpoint, it would be preferable to at least also allow > some primitive form of authentication at a global level, whether this is by > some token provided to the runtime environment or is a "system user" using > basic authentication of a username/password combination - I am not strictly > opinionated and I think either would suffice. > I appreciate that one could implement a custom proxy to provide this > authentication check, but it seems like a common use case that others may > benefit from to be able to authenticate against the rest submission endpoint, > and by implementing this capability as an optionally configurable aspect of > Spark itself, we can utilise the existing server to provide this check. > I would imagine that whatever solution is agreed for a first phase, a custom > authenticator may be something we want a user to be able to provide so that > if an admin needed some more advanced authentication check, such as RBAC et > al, it could be facilitated without the need for writing a complete custom > proxy layer; but I do feel there should be some basic built in available; eg. > RestSubmissionBasicAuthenticator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39736) Enable base image build in SparkR job
[ https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565052#comment-17565052 ] Apache Spark commented on SPARK-39736: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37158 > Enable base image build in SparkR job > - > > Key: SPARK-39736 > URL: https://issues.apache.org/jira/browse/SPARK-39736 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39736) Enable base image build in SparkR job
[ https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39736: Assignee: Apache Spark > Enable base image build in SparkR job > - > > Key: SPARK-39736 > URL: https://issues.apache.org/jira/browse/SPARK-39736 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39736) Enable base image build in SparkR job
[ https://issues.apache.org/jira/browse/SPARK-39736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39736: Assignee: (was: Apache Spark) > Enable base image build in SparkR job > - > > Key: SPARK-39736 > URL: https://issues.apache.org/jira/browse/SPARK-39736 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39667) Add another workaround when there is not enough memory to build and broadcast the table
[ https://issues.apache.org/jira/browse/SPARK-39667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-39667. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37069 [https://github.com/apache/spark/pull/37069] > Add another workaround when there is not enough memory to build and broadcast > the table > --- > > Key: SPARK-39667 > URL: https://issues.apache.org/jira/browse/SPARK-39667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39667) Add another workaround when there is not enough memory to build and broadcast the table
[ https://issues.apache.org/jira/browse/SPARK-39667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-39667: --- Assignee: Yuming Wang > Add another workaround when there is not enough memory to build and broadcast > the table > --- > > Key: SPARK-39667 > URL: https://issues.apache.org/jira/browse/SPARK-39667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39744) Add the REGEXP_INSTR function
[ https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564934#comment-17564934 ] Apache Spark commented on SPARK-39744: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37154 > Add the REGEXP_INSTR function > - > > Key: SPARK-39744 > URL: https://issues.apache.org/jira/browse/SPARK-39744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The function should return the position of the specified occurrence of the > regular expression pattern in the input string. If no match is found, returns > 0. See other DBMSs: > - MariaDB: [https://mariadb.com/kb/en/regexp_instr/] > - Oracle: > [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm] > - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr] > - Snowflake: > [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html] > - BigQuery: > [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr] > - Redshift: > [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html] > - Exasol DB: > [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm] > - Vertica: > [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39744) Add the REGEXP_INSTR function
[ https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39744: Assignee: Max Gekk (was: Apache Spark) > Add the REGEXP_INSTR function > - > > Key: SPARK-39744 > URL: https://issues.apache.org/jira/browse/SPARK-39744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The function should return the position of the specified occurrence of the > regular expression pattern in the input string. If no match is found, returns > 0. See other DBMSs: > - MariaDB: [https://mariadb.com/kb/en/regexp_instr/] > - Oracle: > [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm] > - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr] > - Snowflake: > [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html] > - BigQuery: > [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr] > - Redshift: > [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html] > - Exasol DB: > [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm] > - Vertica: > [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39744) Add the REGEXP_INSTR function
[ https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39744: Assignee: Apache Spark (was: Max Gekk) > Add the REGEXP_INSTR function > - > > Key: SPARK-39744 > URL: https://issues.apache.org/jira/browse/SPARK-39744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The function should return the position of the specified occurrence of the > regular expression pattern in the input string. If no match is found, returns > 0. See other DBMSs: > - MariaDB: [https://mariadb.com/kb/en/regexp_instr/] > - Oracle: > [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm] > - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr] > - Snowflake: > [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html] > - BigQuery: > [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr] > - Redshift: > [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html] > - Exasol DB: > [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm] > - Vertica: > [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39744) Add the REGEXP_INSTR function
[ https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39744: - Description: The function should return the position of the specified occurrence of the regular expression pattern in the input string. If no match is found, returns 0. See other DBMSs: - MariaDB: [https://mariadb.com/kb/en/regexp_instr/] - Oracle: [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm] - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr] - Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html] - BigQuery: [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr] - Redshift: [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html] - Exasol DB: [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm] - Vertica: [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm] was: The function should search a string for a regular expression pattern and returns it or NULL of it is not found. See other DBMSs: - Oracle: https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm - DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html - BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html - MariaDB: https://mariadb.com/kb/en/regexp_substr/ - Exasol DB: https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm > Add the REGEXP_INSTR function > - > > Key: SPARK-39744 > URL: https://issues.apache.org/jira/browse/SPARK-39744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The function should return the position of the specified occurrence of the > regular expression pattern in the input string. If no match is found, returns > 0. See other DBMSs: > - MariaDB: [https://mariadb.com/kb/en/regexp_instr/] > - Oracle: > [https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm] > - DB2: [https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr] > - Snowflake: > [https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html] > - BigQuery: > [https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr] > - Redshift: > [https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html] > - Exasol DB: > [https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm] > - Vertica: > [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected
[ https://issues.apache.org/jira/browse/SPARK-39742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564930#comment-17564930 ] Apache Spark commented on SPARK-39742: -- User 'zml1206' has created a pull request for this issue: https://github.com/apache/spark/pull/37156 > Request executor after kill executor, the number of executors is not as > expected > > > Key: SPARK-39742 > URL: https://issues.apache.org/jira/browse/SPARK-39742 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.2.1 >Reporter: zhuml >Priority: Major > > I used the killExecutors and requestExecutors function of SparkContext to > dynamically adjust the resources, and found that the requestExecutors after > killExecutors could not achieve the expected results. > Add unit tests in StandaloneDynamicAllocationSuite.scala > {code:java} > test("kill executors first and then request") { > sc = new SparkContext(appConf > .set(config.EXECUTOR_CORES, 2) > .set(config.CORES_MAX, 8)) > val appId = sc.applicationId > eventually(timeout(10.seconds), interval(10.millis)) { > val apps = getApplications() > assert(apps.size === 1) > assert(apps.head.id === appId) > assert(apps.head.executors.size === 4) // 8 cores total > assert(apps.head.getExecutorLimit === Int.MaxValue) > } > // sync executors between the Master and the driver, needed because > // the driver refuses to kill executors it does not know about > syncExecutors(sc) > val executors = getExecutorIds(sc) > assert(executors.size === 4) > // kill 2 executors > assert(sc.killExecutors(executors.take(3))) > val apps = getApplications() > assert(apps.head.executors.size === 1) > // add 2 executors > assert(sc.requestExecutors(3)) > assert(apps.head.executors.size === 4) > } {code} > 3 did not equal 4 > Expected :4 > Actual :3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35242) Support change catalog default database for spark
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564931#comment-17564931 ] Gabor Roczei commented on SPARK-35242: -- Ok, thanks [~hongdongdong]! > Support change catalog default database for spark > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected
[ https://issues.apache.org/jira/browse/SPARK-39742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39742: Assignee: (was: Apache Spark) > Request executor after kill executor, the number of executors is not as > expected > > > Key: SPARK-39742 > URL: https://issues.apache.org/jira/browse/SPARK-39742 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.2.1 >Reporter: zhuml >Priority: Major > > I used the killExecutors and requestExecutors function of SparkContext to > dynamically adjust the resources, and found that the requestExecutors after > killExecutors could not achieve the expected results. > Add unit tests in StandaloneDynamicAllocationSuite.scala > {code:java} > test("kill executors first and then request") { > sc = new SparkContext(appConf > .set(config.EXECUTOR_CORES, 2) > .set(config.CORES_MAX, 8)) > val appId = sc.applicationId > eventually(timeout(10.seconds), interval(10.millis)) { > val apps = getApplications() > assert(apps.size === 1) > assert(apps.head.id === appId) > assert(apps.head.executors.size === 4) // 8 cores total > assert(apps.head.getExecutorLimit === Int.MaxValue) > } > // sync executors between the Master and the driver, needed because > // the driver refuses to kill executors it does not know about > syncExecutors(sc) > val executors = getExecutorIds(sc) > assert(executors.size === 4) > // kill 2 executors > assert(sc.killExecutors(executors.take(3))) > val apps = getApplications() > assert(apps.head.executors.size === 1) > // add 2 executors > assert(sc.requestExecutors(3)) > assert(apps.head.executors.size === 4) > } {code} > 3 did not equal 4 > Expected :4 > Actual :3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected
[ https://issues.apache.org/jira/browse/SPARK-39742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39742: Assignee: Apache Spark > Request executor after kill executor, the number of executors is not as > expected > > > Key: SPARK-39742 > URL: https://issues.apache.org/jira/browse/SPARK-39742 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.2.1 >Reporter: zhuml >Assignee: Apache Spark >Priority: Major > > I used the killExecutors and requestExecutors function of SparkContext to > dynamically adjust the resources, and found that the requestExecutors after > killExecutors could not achieve the expected results. > Add unit tests in StandaloneDynamicAllocationSuite.scala > {code:java} > test("kill executors first and then request") { > sc = new SparkContext(appConf > .set(config.EXECUTOR_CORES, 2) > .set(config.CORES_MAX, 8)) > val appId = sc.applicationId > eventually(timeout(10.seconds), interval(10.millis)) { > val apps = getApplications() > assert(apps.size === 1) > assert(apps.head.id === appId) > assert(apps.head.executors.size === 4) // 8 cores total > assert(apps.head.getExecutorLimit === Int.MaxValue) > } > // sync executors between the Master and the driver, needed because > // the driver refuses to kill executors it does not know about > syncExecutors(sc) > val executors = getExecutorIds(sc) > assert(executors.size === 4) > // kill 2 executors > assert(sc.killExecutors(executors.take(3))) > val apps = getApplications() > assert(apps.head.executors.size === 1) > // add 2 executors > assert(sc.requestExecutors(3)) > assert(apps.head.executors.size === 4) > } {code} > 3 did not equal 4 > Expected :4 > Actual :3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39744) Add the REGEXP_INSTR function
Max Gekk created SPARK-39744: Summary: Add the REGEXP_INSTR function Key: SPARK-39744 URL: https://issues.apache.org/jira/browse/SPARK-39744 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 3.4.0 The function should search a string for a regular expression pattern and returns it or NULL of it is not found. See other DBMSs: - Oracle: https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm - DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html - BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html - MariaDB: https://mariadb.com/kb/en/regexp_substr/ - Exasol DB: https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39744) Add the REGEXP_INSTR function
[ https://issues.apache.org/jira/browse/SPARK-39744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39744: - Fix Version/s: (was: 3.4.0) > Add the REGEXP_INSTR function > - > > Key: SPARK-39744 > URL: https://issues.apache.org/jira/browse/SPARK-39744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The function should search a string for a regular expression pattern and > returns it or NULL of it is not found. See other DBMSs: > - Oracle: > https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm > - DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr > - Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html > - BigQuery: > https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr > - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html > - MariaDB: https://mariadb.com/kb/en/regexp_substr/ > - Exasol DB: > https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39735) Enable base image build in lint job and fix sparkr env
[ https://issues.apache.org/jira/browse/SPARK-39735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564916#comment-17564916 ] Apache Spark commented on SPARK-39735: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37155 > Enable base image build in lint job and fix sparkr env > -- > > Key: SPARK-39735 > URL: https://issues.apache.org/jira/browse/SPARK-39735 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > Since sparkr 4.2.x has [below new > change](https://github.com/r-devel/r-svn/blob/e6be1f6b14838016e78d6a91f48f21acec7fa4c4/doc/NEWS.Rd#L376): > > Environment variables R_LIBS_USER and R_LIBS_SITE are both now set to the > R system default if unset or empty, and can be set to NULL to indicate an > empty list of user or site library directories. > lastest ubuntu pkg also sync this change and has some changes on > /etc/R/Renviron (): > ``` > $ docker run -ti > ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat > /etc/R/Renviron | grep R_LIBS_SITE > R_LIBS_SITE=${R_LIBS_SITE:-'%S'} > $ docker run -ti > ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat > /etc/R/Renviron.site | grep R_LIBS_SITE > ## edd Jul 2007 Now use R_LIBS_SITE, not R_LIBS > ## edd Mar 2022 Now in Renviron.site reflecting R_LIBS_SITE > R_LIBS_SITE="/usr/local/lib/R/site-library/:${R_LIBS_SITE}:/usr/lib/R/library" > ``` > So, we add `R_LIBS_SITE` to ENV from `/etc/R/Renviron.site` to make sure > search paths right for sparkr. > otherwise, even if we install the `lintr` will cause error like due to > `R_LIBS_SITE` wrong set: > ``` > $ dev/lint-r > Loading required namespace: SparkR > Loading required namespace: lintr > Failed with error: 'there is no package called 'lintr'' > Installing package into '/usr/lib/R/site-library' > (as 'lib' is unspecified) > Error in contrib.url(repos, type) : > trying to use CRAN without setting a mirror > Calls: install.packages -> startsWith -> contrib.url > Execution halted > ``` > [1] https://cran.r-project.org/doc/manuals/r-devel/NEWS.html > [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39739) Upgrade sbt to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-39739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39739. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37151 [https://github.com/apache/spark/pull/37151] > Upgrade sbt to 1.7.0 > > > Key: SPARK-39739 > URL: https://issues.apache.org/jira/browse/SPARK-39739 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > https://eed3si9n.com/sbt-1.7.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39739) Upgrade sbt to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-39739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39739: Assignee: Yang Jie > Upgrade sbt to 1.7.0 > > > Key: SPARK-39739 > URL: https://issues.apache.org/jira/browse/SPARK-39739 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > https://eed3si9n.com/sbt-1.7.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-39743: - Summary: Unable to set zstd compression level while writing parquet files (was: Unable to set zstd compression level while writing parquet) > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39743) Unable to set zstd compression level while writing parquet
Yeachan Park created SPARK-39743: Summary: Unable to set zstd compression level while writing parquet Key: SPARK-39743 URL: https://issues.apache.org/jira/browse/SPARK-39743 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Yeachan Park While writing zstd compressed parquet files, the following setting `spark.io.compression.zstd.level` does not have any affect with regards to the compression level of zstd. All files seem to be written with the default zstd compression level, and the config option seems to be ignored. Using the zstd cli tool, we confirmed that setting a higher compression level for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39619) PrometheusServlet: add "TYPE" comment to exposed metrics
[ https://issues.apache.org/jira/browse/SPARK-39619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564892#comment-17564892 ] Eric Barault commented on SPARK-39619: -- A PR was provided here: https://github.com/apache/spark/pull/37153 > PrometheusServlet: add "TYPE" comment to exposed metrics > > > Key: SPARK-39619 > URL: https://issues.apache.org/jira/browse/SPARK-39619 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Eric Barault >Priority: Major > > The PrometheusServlet sink does not include the usual comments when exposing > the metrics in the prometheus format > e.g. `# TYPE nginx_ingress_controller_ingress_upstream_latency_seconds > summary` > which prevents some client/integrations that depend on them to assess the > metric type to work properly. > For example the AWS cloudwatch agent prometheus plugin attempts to get the > metric type from the TYPE comment and considers any metric with no type as > unsuported and hence drops it. > As a result the cloudwatch agent prometheus drops all the metrics exposed by > the PrometheusServlet. > [https://github.com/aws/amazon-cloudwatch-agent/blob/1f654cf69c1269073673ba2f636738c556248a31/plugins/inputs/prometheus_scraper/metrics_type_handler.go#L190] > > This would be solved by adding the TYPE comments to the metrics exposed by > the PrometheusServlet sink. > > _*references:*_ > - [https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/] > - > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala] > - > [https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39721) Deprecate databaseName in listColumns if needed
[ https://issues.apache.org/jira/browse/SPARK-39721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-39721: -- Summary: Deprecate databaseName in listColumns if needed (was: Add getTable/getDatabase/getFunction in SparkR support 3L namespace) > Deprecate databaseName in listColumns if needed > --- > > Key: SPARK-39721 > URL: https://issues.apache.org/jira/browse/SPARK-39721 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39721) Deprecate databaseName in listColumns in SparkR if needed
[ https://issues.apache.org/jira/browse/SPARK-39721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-39721: -- Summary: Deprecate databaseName in listColumns in SparkR if needed (was: Deprecate databaseName in listColumns if needed) > Deprecate databaseName in listColumns in SparkR if needed > - > > Key: SPARK-39721 > URL: https://issues.apache.org/jira/browse/SPARK-39721 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-39721) Deprecate databaseName in listColumns in SparkR if needed
[ https://issues.apache.org/jira/browse/SPARK-39721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reopened SPARK-39721: --- > Deprecate databaseName in listColumns in SparkR if needed > - > > Key: SPARK-39721 > URL: https://issues.apache.org/jira/browse/SPARK-39721 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39742) Request executor after kill executor, the number of executors is not as expected
zhuml created SPARK-39742: - Summary: Request executor after kill executor, the number of executors is not as expected Key: SPARK-39742 URL: https://issues.apache.org/jira/browse/SPARK-39742 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 3.2.1 Reporter: zhuml I used the killExecutors and requestExecutors function of SparkContext to dynamically adjust the resources, and found that the requestExecutors after killExecutors could not achieve the expected results. Add unit tests in StandaloneDynamicAllocationSuite.scala {code:java} test("kill executors first and then request") { sc = new SparkContext(appConf .set(config.EXECUTOR_CORES, 2) .set(config.CORES_MAX, 8)) val appId = sc.applicationId eventually(timeout(10.seconds), interval(10.millis)) { val apps = getApplications() assert(apps.size === 1) assert(apps.head.id === appId) assert(apps.head.executors.size === 4) // 8 cores total assert(apps.head.getExecutorLimit === Int.MaxValue) } // sync executors between the Master and the driver, needed because // the driver refuses to kill executors it does not know about syncExecutors(sc) val executors = getExecutorIds(sc) assert(executors.size === 4) // kill 2 executors assert(sc.killExecutors(executors.take(3))) val apps = getApplications() assert(apps.head.executors.size === 1) // add 2 executors assert(sc.requestExecutors(3)) assert(apps.head.executors.size === 4) } {code} 3 did not equal 4 Expected :4 Actual :3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26052) Spark should output a _SUCCESS file for every partition correctly written
[ https://issues.apache.org/jira/browse/SPARK-26052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564878#comment-17564878 ] Apache Spark commented on SPARK-26052: -- User 'danielhaviv' has created a pull request for this issue: https://github.com/apache/spark/pull/37153 > Spark should output a _SUCCESS file for every partition correctly written > - > > Key: SPARK-26052 > URL: https://issues.apache.org/jira/browse/SPARK-26052 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Matt Matolcsi >Priority: Minor > Labels: bulk-closed > > When writing a set of partitioned Parquet files to HDFS using > dataframe.write.parquet(), a _SUCCESS file is written to hdfs://path/to/table > after successful completion, though the actual Parquet files will end up in > hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ > If partitions are written out one at a time (e.g., an hourly ETL), the > _SUCCESS file is overwritten by each subsequent run and information on what > partitions were correctly written is lost. > I would like to be able to keep track of what partitions were successfully > written in HDFS. I think this could be done by writing the _SUCCESS files to > the same partition directories where the Parquet files reside, i.e., > hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ > Since https://issues.apache.org/jira/browse/SPARK-13207 has been resolved, I > don't think this should break partition discovery. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26052) Spark should output a _SUCCESS file for every partition correctly written
[ https://issues.apache.org/jira/browse/SPARK-26052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564877#comment-17564877 ] Apache Spark commented on SPARK-26052: -- User 'danielhaviv' has created a pull request for this issue: https://github.com/apache/spark/pull/37153 > Spark should output a _SUCCESS file for every partition correctly written > - > > Key: SPARK-26052 > URL: https://issues.apache.org/jira/browse/SPARK-26052 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Matt Matolcsi >Priority: Minor > Labels: bulk-closed > > When writing a set of partitioned Parquet files to HDFS using > dataframe.write.parquet(), a _SUCCESS file is written to hdfs://path/to/table > after successful completion, though the actual Parquet files will end up in > hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ > If partitions are written out one at a time (e.g., an hourly ETL), the > _SUCCESS file is overwritten by each subsequent run and information on what > partitions were correctly written is lost. > I would like to be able to keep track of what partitions were successfully > written in HDFS. I think this could be done by writing the _SUCCESS files to > the same partition directories where the Parquet files reside, i.e., > hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ > Since https://issues.apache.org/jira/browse/SPARK-13207 has been resolved, I > don't think this should break partition discovery. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39741) Support url encode/decode as built-in function
[ https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564832#comment-17564832 ] Apache Spark commented on SPARK-39741: -- User 'Yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/37113 > Support url encode/decode as built-in function > -- > > Key: SPARK-39741 > URL: https://issues.apache.org/jira/browse/SPARK-39741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Minor > Fix For: 3.4.0 > > > Currently, Spark don't support url encode/decode as built-in functions, the > user might use reflect instead, It's a bit of a hassle, And often these > functions are useful. > This pr aims to add url encode/decode as built-in function support. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39741) Support url encode/decode as built-in function
[ https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39741: Assignee: (was: Apache Spark) > Support url encode/decode as built-in function > -- > > Key: SPARK-39741 > URL: https://issues.apache.org/jira/browse/SPARK-39741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Minor > Fix For: 3.4.0 > > > Currently, Spark don't support url encode/decode as built-in functions, the > user might use reflect instead, It's a bit of a hassle, And often these > functions are useful. > This pr aims to add url encode/decode as built-in function support. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39741) Support url encode/decode as built-in function
[ https://issues.apache.org/jira/browse/SPARK-39741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39741: Assignee: Apache Spark > Support url encode/decode as built-in function > -- > > Key: SPARK-39741 > URL: https://issues.apache.org/jira/browse/SPARK-39741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Apache Spark >Priority: Minor > Fix For: 3.4.0 > > > Currently, Spark don't support url encode/decode as built-in functions, the > user might use reflect instead, It's a bit of a hassle, And often these > functions are useful. > This pr aims to add url encode/decode as built-in function support. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39741) Support url encode/decode as built-in function
Yi kaifei created SPARK-39741: - Summary: Support url encode/decode as built-in function Key: SPARK-39741 URL: https://issues.apache.org/jira/browse/SPARK-39741 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: Yi kaifei Fix For: 3.4.0 Currently, Spark don't support url encode/decode as built-in functions, the user might use reflect instead, It's a bit of a hassle, And often these functions are useful. This pr aims to add url encode/decode as built-in function support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's
[ https://issues.apache.org/jira/browse/SPARK-39728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39728: Assignee: Andrew Ray > Test for parity of SQL functions between Python and JVM DataFrame API's > --- > > Key: SPARK-39728 > URL: https://issues.apache.org/jira/browse/SPARK-39728 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Assignee: Andrew Ray >Priority: Minor > > Add a unit test that compares the available list of Python DataFrame > functions in pyspark.sql.functions with those available in the Scala/Java > DataFrame API in org.apache.spark.sql.functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's
[ https://issues.apache.org/jira/browse/SPARK-39728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39728. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37144 [https://github.com/apache/spark/pull/37144] > Test for parity of SQL functions between Python and JVM DataFrame API's > --- > > Key: SPARK-39728 > URL: https://issues.apache.org/jira/browse/SPARK-39728 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Assignee: Andrew Ray >Priority: Minor > Fix For: 3.4.0 > > > Add a unit test that compares the available list of Python DataFrame > functions in pyspark.sql.functions with those available in the Scala/Java > DataFrame API in org.apache.spark.sql.functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39735) Enable base image build in lint job and fix sparkr env
[ https://issues.apache.org/jira/browse/SPARK-39735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39735. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37149 [https://github.com/apache/spark/pull/37149] > Enable base image build in lint job and fix sparkr env > -- > > Key: SPARK-39735 > URL: https://issues.apache.org/jira/browse/SPARK-39735 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > Since sparkr 4.2.x has [below new > change](https://github.com/r-devel/r-svn/blob/e6be1f6b14838016e78d6a91f48f21acec7fa4c4/doc/NEWS.Rd#L376): > > Environment variables R_LIBS_USER and R_LIBS_SITE are both now set to the > R system default if unset or empty, and can be set to NULL to indicate an > empty list of user or site library directories. > lastest ubuntu pkg also sync this change and has some changes on > /etc/R/Renviron (): > ``` > $ docker run -ti > ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat > /etc/R/Renviron | grep R_LIBS_SITE > R_LIBS_SITE=${R_LIBS_SITE:-'%S'} > $ docker run -ti > ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat > /etc/R/Renviron.site | grep R_LIBS_SITE > ## edd Jul 2007 Now use R_LIBS_SITE, not R_LIBS > ## edd Mar 2022 Now in Renviron.site reflecting R_LIBS_SITE > R_LIBS_SITE="/usr/local/lib/R/site-library/:${R_LIBS_SITE}:/usr/lib/R/library" > ``` > So, we add `R_LIBS_SITE` to ENV from `/etc/R/Renviron.site` to make sure > search paths right for sparkr. > otherwise, even if we install the `lintr` will cause error like due to > `R_LIBS_SITE` wrong set: > ``` > $ dev/lint-r > Loading required namespace: SparkR > Loading required namespace: lintr > Failed with error: 'there is no package called 'lintr'' > Installing package into '/usr/lib/R/site-library' > (as 'lib' is unspecified) > Error in contrib.url(repos, type) : > trying to use CRAN without setting a mirror > Calls: install.packages -> startsWith -> contrib.url > Execution halted > ``` > [1] https://cran.r-project.org/doc/manuals/r-devel/NEWS.html > [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39735) Enable base image build in lint job and fix sparkr env
[ https://issues.apache.org/jira/browse/SPARK-39735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39735: Assignee: Yikun Jiang > Enable base image build in lint job and fix sparkr env > -- > > Key: SPARK-39735 > URL: https://issues.apache.org/jira/browse/SPARK-39735 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > Since sparkr 4.2.x has [below new > change](https://github.com/r-devel/r-svn/blob/e6be1f6b14838016e78d6a91f48f21acec7fa4c4/doc/NEWS.Rd#L376): > > Environment variables R_LIBS_USER and R_LIBS_SITE are both now set to the > R system default if unset or empty, and can be set to NULL to indicate an > empty list of user or site library directories. > lastest ubuntu pkg also sync this change and has some changes on > /etc/R/Renviron (): > ``` > $ docker run -ti > ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat > /etc/R/Renviron | grep R_LIBS_SITE > R_LIBS_SITE=${R_LIBS_SITE:-'%S'} > $ docker run -ti > ghcr.io/yikun/apache-spark-github-action-image:sparkr-master-2569799176 cat > /etc/R/Renviron.site | grep R_LIBS_SITE > ## edd Jul 2007 Now use R_LIBS_SITE, not R_LIBS > ## edd Mar 2022 Now in Renviron.site reflecting R_LIBS_SITE > R_LIBS_SITE="/usr/local/lib/R/site-library/:${R_LIBS_SITE}:/usr/lib/R/library" > ``` > So, we add `R_LIBS_SITE` to ENV from `/etc/R/Renviron.site` to make sure > search paths right for sparkr. > otherwise, even if we install the `lintr` will cause error like due to > `R_LIBS_SITE` wrong set: > ``` > $ dev/lint-r > Loading required namespace: SparkR > Loading required namespace: lintr > Failed with error: 'there is no package called 'lintr'' > Installing package into '/usr/lib/R/site-library' > (as 'lib' is unspecified) > Error in contrib.url(repos, type) : > trying to use CRAN without setting a mirror > Calls: install.packages -> startsWith -> contrib.url > Execution halted > ``` > [1] https://cran.r-project.org/doc/manuals/r-devel/NEWS.html > [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39740) vis-timeline @ 4.2.1 vulnerable to XSS attacks
Eugene Shinn (Truveta) created SPARK-39740: -- Summary: vis-timeline @ 4.2.1 vulnerable to XSS attacks Key: SPARK-39740 URL: https://issues.apache.org/jira/browse/SPARK-39740 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.3.0, 3.2.1 Reporter: Eugene Shinn (Truveta) Spark UI includes visjs/vis-timeline package@4.2.1, which is vulnerable to XSS attacks ([Cross-site Scripting in vis-timeline · CVE-2020-28487 · GitHub Advisory Database|https://github.com/advisories/GHSA-9mrv-456v-pf22]). This version should be replaced with the next non-vulnerable issue - [Release v7.4.4 · visjs/vis-timeline (github.com)|https://github.com/visjs/vis-timeline/releases/tag/v7.4.4] or higher. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org