[jira] [Updated] (SPARK-48420) Upgrade netty to `4.1.110.Final`
[ https://issues.apache.org/jira/browse/SPARK-48420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48420: --- Labels: pull-request-available (was: ) > Upgrade netty to `4.1.110.Final` > > > Key: SPARK-48420 > URL: https://issues.apache.org/jira/browse/SPARK-48420 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48420) Upgrade netty to `4.1.110.Final`
BingKun Pan created SPARK-48420: --- Summary: Upgrade netty to `4.1.110.Final` Key: SPARK-48420 URL: https://issues.apache.org/jira/browse/SPARK-48420 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48419) Foldable propagation replace foldable column should use origin column
[ https://issues.apache.org/jira/browse/SPARK-48419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48419: --- Labels: pull-request-available (was: ) > Foldable propagation replace foldable column should use origin column > - > > Key: SPARK-48419 > URL: https://issues.apache.org/jira/browse/SPARK-48419 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.4, 4.0.0, 3.5.1, 3.3.4 >Reporter: KnightChess >Priority: Major > Labels: pull-request-available > > column name will be change by `FoldablePropagation` in optimizer > befor optimizer: > ```shell > 'Project ['x, 'y, 'z] > +- 'Project ['a AS x#112, str AS Y#113, 'b AS z#114] > +- LocalRelation , [a#0, b#1] > ``` > after optimizer: > ```shell > Project [x#112, str AS Y#113, z#114] > +- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114] > +- LocalRelation , [a#0, b#1] > ``` > column name `y` will be replace to 'Y' -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48419) Foldable propagation replace foldable column should use origin column
[ https://issues.apache.org/jira/browse/SPARK-48419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KnightChess updated SPARK-48419: Summary: Foldable propagation replace foldable column should use origin column (was: Foldable propagation change output schema) > Foldable propagation replace foldable column should use origin column > - > > Key: SPARK-48419 > URL: https://issues.apache.org/jira/browse/SPARK-48419 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.4, 4.0.0, 3.5.1, 3.3.4 >Reporter: KnightChess >Priority: Major > > column name will be change by `FoldablePropagation` in optimizer > befor optimizer: > ```shell > 'Project ['x, 'y, 'z] > +- 'Project ['a AS x#112, str AS Y#113, 'b AS z#114] > +- LocalRelation , [a#0, b#1] > ``` > after optimizer: > ```shell > Project [x#112, str AS Y#113, z#114] > +- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114] > +- LocalRelation , [a#0, b#1] > ``` > column name `y` will be replace to 'Y' -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48419) Foldable propagation change output schema
KnightChess created SPARK-48419: --- Summary: Foldable propagation change output schema Key: SPARK-48419 URL: https://issues.apache.org/jira/browse/SPARK-48419 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.4, 3.5.1, 3.2.4, 3.1.3, 3.0.3, 4.0.0 Reporter: KnightChess column name will be change by `FoldablePropagation` in optimizer befor optimizer: ```shell 'Project ['x, 'y, 'z] +- 'Project ['a AS x#112, str AS Y#113, 'b AS z#114] +- LocalRelation , [a#0, b#1] ``` after optimizer: ```shell Project [x#112, str AS Y#113, z#114] +- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114] +- LocalRelation , [a#0, b#1] ``` column name `y` will be replace to 'Y' -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48418) Spark structured streaming: Add microbatch timestamp to foreachBatch method
[ https://issues.apache.org/jira/browse/SPARK-48418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anil Dasari updated SPARK-48418: Description: We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Differences in Dstream microbatch and structured streaming microbatch metadata is making migration difficult. Dstream#foreachRDD gives both microbatch RDD and start timestamp (in long). However, structured streaming Dataset#foreachBatch returns only microbatch dataset and batchID where BatchID is a numeric number. micorbatch start time used across our data pipelines and final result. Could you add microbatch start timestamp to Dataset#foreachBatch method? Pseudo code : {code:java} val inputStream = sparkSession.readStream.format("rate").load inputStream .writeStream .trigger(Trigger.ProcessingTime(10 * 1000)) .foreachBatch { (ds: Dataset[Row], batchId: Long, batchTime: Long) => // batchTime is microbatch triggered/start timestamp // application logic. } .start() .awaitTermination() {code} Implementation approach when batchTime is trigger executor executed time: ( `currentTriggerStartTimestamp` can be used as well as batch time. Trigger executor time is source of microbatch and also can be easily added to query processor event as well) 1. Add trigger time to [TriggerExecutor|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L31] {code:java} trait TriggerExecutor { // added batchTime (Long) argument def execute(batchRunner: (MicroBatchExecutionContext, Long) => Boolean): Unit ... // (other methods) }{code} 2. Update ProcessingTimeExecutor and other executors to pass trigger time. {code:java} override def execute(triggerHandler: (MicroBatchExecutionContext, Long) => Boolean): Unit = { while (true) { val triggerTimeMs = clock.getTimeMillis() val nextTriggerTimeMs = nextBatchTime(triggerTimeMs) // pass triggerTimeMs to runOneBatch which invokes triggerHandler and is used in MicroBatchExecution#runActivatedStream method. val terminated = !runOneBatch(triggerHandler, triggerTimeMs) ... } } {code} 3. Add argument executionTime (long) argument to MicroBatchExecution#excuteOneBatch method [here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L330] 4. Pass execution time in [runBatch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L380C11-L380C19] and [here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L849] 5. Finally add following following method `foreachBatch` in `DataStreamWriter` and update existing `foreachBatch` methods for new argument. and also add it to query processor event. {code:java} def foreachBatch(function: (Dataset[T], Long, Long) => Unit): DataStreamWriter[T] = { this.source = SOURCE_NAME_FOREACH_BATCH if (function == null) throw new IllegalArgumentException("foreachBatch function cannot be null") this.foreachBatchWriter = function this }{code} Let me know your thoughts. was: We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Differences in Dstream microbatch and structured streaming microbatch metadata is making migration difficult. Dstream#foreachRDD gives both microbatch RDD and start timestamp (in long). However, structured streaming Dataset#foreachBatch returns only microbatch dataset and batchID where BatchID is a numeric number. micorbatch start time used across our data pipelines and final result. Could you add microbatch start timestamp to Dataset#foreachBatch method? Pseudo code : {code:java} val inputStream = sparkSession.readStream.format("rate").load inputStream .writeStream .trigger(Trigger.ProcessingTime(10 * 1000)) .foreachBatch { (ds: Dataset[Row], batchId: Long, batchTime: Long) => // batchTime is microbatch triggered/start timestamp // application logic. } .start() .awaitTermination() {code} Implementation approach when batchTime is trigger executor executed time: ( `currentTriggerStartTimestamp` can be used as well as batch time. Trigger executor time is source of microbatch and also can be easily added to query processor event as well) # Add trigger time to [TriggerExecutor|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L31] {code:java} trait TriggerExecutor { // added batchTime (Long) argument def execute(batchRunner: (MicroBatchExecutionContext, Long) => Boolean): Unit ... // (other methods) }{code} #
[jira] [Created] (SPARK-48418) Spark structured streaming: Add microbatch timestamp to foreachBatch method
Anil Dasari created SPARK-48418: --- Summary: Spark structured streaming: Add microbatch timestamp to foreachBatch method Key: SPARK-48418 URL: https://issues.apache.org/jira/browse/SPARK-48418 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.5.1 Reporter: Anil Dasari We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Differences in Dstream microbatch and structured streaming microbatch metadata is making migration difficult. Dstream#foreachRDD gives both microbatch RDD and start timestamp (in long). However, structured streaming Dataset#foreachBatch returns only microbatch dataset and batchID where BatchID is a numeric number. micorbatch start time used across our data pipelines and final result. Could you add microbatch start timestamp to Dataset#foreachBatch method? Pseudo code : {code:java} val inputStream = sparkSession.readStream.format("rate").load inputStream .writeStream .trigger(Trigger.ProcessingTime(10 * 1000)) .foreachBatch { (ds: Dataset[Row], batchId: Long, batchTime: Long) => // batchTime is microbatch triggered/start timestamp // application logic. } .start() .awaitTermination() {code} Implementation approach when batchTime is trigger executor executed time: ( `currentTriggerStartTimestamp` can be used as well as batch time. Trigger executor time is source of microbatch and also can be easily added to query processor event as well) # Add trigger time to [TriggerExecutor|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L31] {code:java} trait TriggerExecutor { // added batchTime (Long) argument def execute(batchRunner: (MicroBatchExecutionContext, Long) => Boolean): Unit ... // (other methods) }{code} # Update ProcessingTimeExecutor and other executors to pass trigger time. {code:java} override def execute(triggerHandler: (MicroBatchExecutionContext, Long) => Boolean): Unit = { while (true) { val triggerTimeMs = clock.getTimeMillis() val nextTriggerTimeMs = nextBatchTime(triggerTimeMs) // pass triggerTimeMs to runOneBatch which invokes triggerHandler and is used in MicroBatchExecution#runActivatedStream method. val terminated = !runOneBatch(triggerHandler, triggerTimeMs) ... } } {code} # Add argument executionTime (long) argument to MicroBatchExecution#excuteOneBatch method [here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L330] # Pass execution time in [runBatch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L380C11-L380C19] and [here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L849] # Finally add following following method `foreachBatch` in `DataStreamWriter` and update existing `foreachBatch` methods for new argument. and also add it to query processor event. {code:java} def foreachBatch(function: (Dataset[T], Long, Long) => Unit): DataStreamWriter[T] = { this.source = SOURCE_NAME_FOREACH_BATCH if (function == null) throw new IllegalArgumentException("foreachBatch function cannot be null") this.foreachBatchWriter = function this }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47041) PushDownUtils uses FileScanBuilder instead of SupportsPushDownCatalystFilters trait
[ https://issues.apache.org/jira/browse/SPARK-47041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47041: --- Labels: pull-request-available (was: ) > PushDownUtils uses FileScanBuilder instead of SupportsPushDownCatalystFilters > trait > --- > > Key: SPARK-47041 > URL: https://issues.apache.org/jira/browse/SPARK-47041 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Никита Соколов >Priority: Major > Labels: pull-request-available > > It could use an existing more generic interface looking like it was created > for that reason, but uses a narrower type forcing you to extend > FileScanBuilder when implementing a ScanBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48416) Support related nested WITH expression
[ https://issues.apache.org/jira/browse/SPARK-48416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48416: --- Labels: pull-request-available (was: ) > Support related nested WITH expression > -- > > Key: SPARK-48416 > URL: https://issues.apache.org/jira/browse/SPARK-48416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Mingliang Zhu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister
[ https://issues.apache.org/jira/browse/SPARK-48394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48394. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46706 [https://github.com/apache/spark/pull/46706] > Cleanup mapIdToMapIndex on mapoutput unregister > --- > > Key: SPARK-48394 > URL: https://issues.apache.org/jira/browse/SPARK-48394 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > There is only one valid mapstatus for the same {{mapIndex}} at the same time > in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid > chaos. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister
[ https://issues.apache.org/jira/browse/SPARK-48394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48394: - Assignee: wuyi > Cleanup mapIdToMapIndex on mapoutput unregister > --- > > Key: SPARK-48394 > URL: https://issues.apache.org/jira/browse/SPARK-48394 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > > There is only one valid mapstatus for the same {{mapIndex}} at the same time > in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid > chaos. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48407) Teradata: Document Type Conversion rules between Spark SQL and teradata
[ https://issues.apache.org/jira/browse/SPARK-48407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48407. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46728 [https://github.com/apache/spark/pull/46728] > Teradata: Document Type Conversion rules between Spark SQL and teradata > --- > > Key: SPARK-48407 > URL: https://issues.apache.org/jira/browse/SPARK-48407 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48325) Always specify messages in ExecutorRunner.killProcess
[ https://issues.apache.org/jira/browse/SPARK-48325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48325. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46641 [https://github.com/apache/spark/pull/46641] > Always specify messages in ExecutorRunner.killProcess > - > > Key: SPARK-48325 > URL: https://issues.apache.org/jira/browse/SPARK-48325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > For some of the cases in ExecutorRunner.killProcess, the argument `message` > is `None`. We should always specify the message so that we can get the > occurrence rate for different cases, in order to analyze executor running > stability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration
[ https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849399#comment-17849399 ] Ravi Dalal commented on SPARK-48417: For anyone facing this issue, use following configuration to read file from GCS when spark.jars.packages is used: {code:java} config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.22.jar;) config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"){code} When spark.jars.pacakges is not used, following configuration alone works: {code:java} config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.22.jar;) config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") {code} > Filesystems do not load with spark.jars.packages configuration > -- > > Key: SPARK-48417 > URL: https://issues.apache.org/jira/browse/SPARK-48417 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.1 >Reporter: Ravi Dalal >Priority: Major > Attachments: pyspark_mleap.py, > pyspark_spark_jar_package_config_logs.txt, > pyspark_without_spark_jar_package_config_logs.txt > > > When we use spark.jars.packages configuration parameter in Python > SparkSession Builder (Pyspark), it appears that the filesystems are not > loaded when session starts. Because of this, Spark fails to read file from > Google Cloud Storage (GCS) bucket (with GCS Connector). > I tested this with different packages so it does not appear specific to a > particular package. I will attach the sample code and debug logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration
[ https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Dalal closed SPARK-48417. -- > Filesystems do not load with spark.jars.packages configuration > -- > > Key: SPARK-48417 > URL: https://issues.apache.org/jira/browse/SPARK-48417 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.1 >Reporter: Ravi Dalal >Priority: Major > Attachments: pyspark_mleap.py, > pyspark_spark_jar_package_config_logs.txt, > pyspark_without_spark_jar_package_config_logs.txt > > > When we use spark.jars.packages configuration parameter in Python > SparkSession Builder (Pyspark), it appears that the filesystems are not > loaded when session starts. Because of this, Spark fails to read file from > Google Cloud Storage (GCS) bucket (with GCS Connector). > I tested this with different packages so it does not appear specific to a > particular package. I will attach the sample code and debug logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration
[ https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Dalal resolved SPARK-48417. Resolution: Not A Problem Apologies. We missed a configuration parameter. Found it after creating this bug. Resolving the bug now. > Filesystems do not load with spark.jars.packages configuration > -- > > Key: SPARK-48417 > URL: https://issues.apache.org/jira/browse/SPARK-48417 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.1 >Reporter: Ravi Dalal >Priority: Major > Attachments: pyspark_mleap.py, > pyspark_spark_jar_package_config_logs.txt, > pyspark_without_spark_jar_package_config_logs.txt > > > When we use spark.jars.packages configuration parameter in Python > SparkSession Builder (Pyspark), it appears that the filesystems are not > loaded when session starts. Because of this, Spark fails to read file from > Google Cloud Storage (GCS) bucket (with GCS Connector). > I tested this with different packages so it does not appear specific to a > particular package. I will attach the sample code and debug logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration
[ https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Dalal updated SPARK-48417: --- Attachment: pyspark_mleap.py pyspark_spark_jar_package_config_logs.txt pyspark_without_spark_jar_package_config_logs.txt > Filesystems do not load with spark.jars.packages configuration > -- > > Key: SPARK-48417 > URL: https://issues.apache.org/jira/browse/SPARK-48417 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.1 >Reporter: Ravi Dalal >Priority: Major > Attachments: pyspark_mleap.py, > pyspark_spark_jar_package_config_logs.txt, > pyspark_without_spark_jar_package_config_logs.txt > > > When we use spark.jars.packages configuration parameter in Python > SparkSession Builder (Pyspark), it appears that the filesystems are not > loaded when session starts. Because of this, Spark fails to read file from > Google Cloud Storage (GCS) bucket (with GCS Connector). > I tested this with different packages so it does not appear specific to a > particular package. I will attach the sample code and debug logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration
Ravi Dalal created SPARK-48417: -- Summary: Filesystems do not load with spark.jars.packages configuration Key: SPARK-48417 URL: https://issues.apache.org/jira/browse/SPARK-48417 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.5.1 Reporter: Ravi Dalal When we use spark.jars.packages configuration parameter in Python SparkSession Builder (Pyspark), it appears that the filesystems are not loaded when session starts. Because of this, Spark fails to read file from Google Cloud Storage (GCS) bucket (with GCS Connector). I tested this with different packages so it does not appear specific to a particular package. I will attach the sample code and debug logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
[ https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48411: --- Labels: pull-request-available (was: ) > Add E2E test for DropDuplicateWithinWatermark > - > > Key: SPARK-48411 > URL: https://issues.apache.org/jira/browse/SPARK-48411 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > > Currently we do not have a e2e test for DropDuplicateWithinWatermark, we > should add one. We can simply use one of the test written in Scala here (with > the testStream API) and replicate it to python: > [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] > > The change should happen in > [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] > > so we can test it in both connect and non-connect. > > Test with: > ``` > python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming > python/run-tests --testnames > pyspark.sql.tests.connect.streaming.test_parity_streaming > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated
[ https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-45373: - Priority: Major (was: Minor) > Minimizing calls to HiveMetaStore layer for getting partitions, when tables > are repeated > - > > Key: SPARK-45373 > URL: https://issues.apache.org/jira/browse/SPARK-45373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > In the rule PruneFileSourcePartitions where the CatalogFileIndex gets > converted to InMemoryFileIndex, the HMS calls can get very expensive if : > 1) The translated filter string for push down to HMS layer becomes empty , > resulting in fetching of all partitions and same table is referenced multiple > times in the query. > 2) Or just in case same table is referenced multiple times in the query with > different partition filters. > In such cases current code would result in multiple calls to HMS layer. > This can be avoided by grouping the tables based on CatalogFileIndex and > passing a common minimum filter ( filter1 || filter2) and getting a base > PrunedInmemoryFileIndex which can become a basis for each of the specific > table. > Opened following PR for ticket: > [SPARK-45373-PR|https://github.com/apache/spark/pull/43183] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
[ https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849356#comment-17849356 ] Yuchen Liu commented on SPARK-48411: I will work on this. > Add E2E test for DropDuplicateWithinWatermark > - > > Key: SPARK-48411 > URL: https://issues.apache.org/jira/browse/SPARK-48411 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > Currently we do not have a e2e test for DropDuplicateWithinWatermark, we > should add one. We can simply use one of the test written in Scala here (with > the testStream API) and replicate it to python: > [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] > > The change should happen in > [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] > > so we can test it in both connect and non-connect. > > Test with: > ``` > python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming > python/run-tests --testnames > pyspark.sql.tests.connect.streaming.test_parity_streaming > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
[ https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849348#comment-17849348 ] Wei Liu commented on SPARK-48411: - Sorry I tagged the wrong Yuchen > Add E2E test for DropDuplicateWithinWatermark > - > > Key: SPARK-48411 > URL: https://issues.apache.org/jira/browse/SPARK-48411 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > Currently we do not have a e2e test for DropDuplicateWithinWatermark, we > should add one. We can simply use one of the test written in Scala here (with > the testStream API) and replicate it to python: > [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] > > The change should happen in > [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] > > so we can test it in both connect and non-connect. > > Test with: > ``` > python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming > python/run-tests --testnames > pyspark.sql.tests.connect.streaming.test_parity_streaming > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48338) Sql Scripting support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Milicevic updated SPARK-48338: Attachment: [Design Doc] Sql Scripting - OSS.pdf > Sql Scripting support for Spark SQL > --- > > Key: SPARK-48338 > URL: https://issues.apache.org/jira/browse/SPARK-48338 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - > OSS.pdf > > > Design doc for this feature is in attachment. > High level example of Sql Script: > ``` > BEGIN > DECLARE c INT = 10; > WHILE c > 0 DO > INSERT INTO tscript VALUES (c); > SET c = c - 1; > END WHILE; > END > ``` > High level motivation behind this feature: > SQL Scripting gives customers the ability to develop complex ETL and analysis > entirely in SQL. Until now, customers have had to write verbose SQL > statements or combine SQL + Python to efficiently write business logic. > Coming from another system, customers have to choose whether or not they want > to migrate to pyspark. Some customers end up not using Spark because of this > gap. SQL Scripting is a key milestone towards enabling SQL practitioners to > write sophisticated queries, without the need to use pyspark. Further, SQL > Scripting is a necessary step towards support for SQL Stored Procedures, and > along with SQL Variables (released) and Temp Tables (in progress), will allow > for more seamless data warehouse migrations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48343) Interpreter support
[ https://issues.apache.org/jira/browse/SPARK-48343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Milicevic updated SPARK-48343: Description: Implement interpreter for SQL scripting: * Interpreter * Interpreter testing For more details, design doc can be found in parent Jira item. Update design doc accordingly. was: Implement interpreter for SQL scripting: * Interpreter * Interpreter testing For more details, design doc can be found in parent Jira item. > Interpreter support > --- > > Key: SPARK-48343 > URL: https://issues.apache.org/jira/browse/SPARK-48343 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Priority: Major > > Implement interpreter for SQL scripting: > * Interpreter > * Interpreter testing > For more details, design doc can be found in parent Jira item. > Update design doc accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48416) Support related nested WITH expression
Mingliang Zhu created SPARK-48416: - Summary: Support related nested WITH expression Key: SPARK-48416 URL: https://issues.apache.org/jira/browse/SPARK-48416 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Mingliang Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48388) Fix SET behavior for scripts
[ https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849301#comment-17849301 ] David Milicevic commented on SPARK-48388: - Already have ready changes. I will work on this as soon as SPARK-48342 is completed. > Fix SET behavior for scripts > > > Key: SPARK-48388 > URL: https://issues.apache.org/jira/browse/SPARK-48388 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Priority: Major > > By standard, SET is used to set variable value in SQL scripts. > On our end, SET is configured to work with some Hive configs, so the grammar > is a bit messed up and for that reason it was decided to use SET VAR instead > of SET to work with SQL variables. > This is not by standard and we should figure out the way to be able to use > SET for SQL variables and forbid setting of Hive configs from SQL scripts. > > For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48415) TypeName support parameterized datatypes
[ https://issues.apache.org/jira/browse/SPARK-48415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48415: --- Labels: pull-request-available (was: ) > TypeName support parameterized datatypes > > > Key: SPARK-48415 > URL: https://issues.apache.org/jira/browse/SPARK-48415 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48415) TypeName support parameterized datatypes
Ruifeng Zheng created SPARK-48415: - Summary: TypeName support parameterized datatypes Key: SPARK-48415 URL: https://issues.apache.org/jira/browse/SPARK-48415 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error
[ https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849267#comment-17849267 ] Abhishek Singh commented on SPARK-21720: Hi, I m using spark version 2.4 I m getting the same issue. {code:java} operation match { case "equals" => val joinDf = filterDf.select(lower(col("field_value")).alias("field_value")).distinct() excludeDf = excludeDf.join(broadcast(joinDf), colLower === joinDf("field_value"), "left_anti") case "contains" => values.foreach { value => excludeDf = excludeDf.filter(!colLower.contains(value)) } } {code} i m using this code to generate a filter condition on around 110 distinct values. The error i m getting is {panel:title=Error log} glue.ProcessLauncher (Logging.scala:logError(70)): Exception in User Class: java.lang.StackOverflowError org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:395) org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557) org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557) org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557) org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557){panel} > Filter predicate with many conditions throw stackoverflow error > --- > > Key: SPARK-21720 > URL: https://issues.apache.org/jira/browse/SPARK-21720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: srinivasan >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.2.1, 2.3.0 > > > When trying to filter on dataset with many predicate conditions on both spark > sql and dataset filter transformation as described below, spark throws a > stackoverflow exception > Case 1: Filter Transformation on Data > Dataset filter = sourceDataset.filter(String.format("not(%s)", > buildQuery())); > filter.show(); > where buildQuery() returns > Field1 = "" and Field2 = "" and Field3 = "" and Field4 = "" and Field5 = > "" and BLANK_5 = "" and Field7 = "" and Field8 = "" and Field9 = "" and > Field10 = "" and Field11 = "" and Field12 = "" and Field13 = "" and > Field14 = "" and Field15 = "" and Field16 = "" and Field17 = "" and > Field18 = "" and Field19 = "" and Field20 = "" and Field21 = "" and > Field22 = "" and Field23 = "" and Field24 = "" and Field25 = "" and > Field26 = "" and Field27 = "" and Field28 = "" and Field29 = "" and > Field30 = "" and Field31 = "" and Field32 = "" and Field33 = "" and > Field34 = "" and Field35 = "" and Field36 = "" and Field37 = "" and > Field38 = "" and Field39 = "" and Field40 = "" and Field41 = "" and > Field42 = "" and Field43 = "" and Field44 = "" and Field45 = "" and > Field46 = "" and Field47 = "" and Field48 = "" and Field49 = "" and > Field50 = "" and Field51 = "" and Field52 = "" and Field53 = "" and > Field54 = "" and Field55 = "" and Field56 = "" and Field57 = "" and > Field58 = "" and Field59 = "" and Field60 = "" and Field61 = "" and > Field62 = "" and Field63 = "" and Field64 = "" and Field65 = "" and > Field66 = "" and Field67 = "" and Field68 = "" and Field69 = "" and > Field70 = "" and Field71 = "" and Field72 = "" and Field73 = "" and > Field74 = "" and Field75 = "" and Field76 = "" and Field77 = "" and > Field78 = "" and Field79 = "" and Field80 = "" and Field81 = "" and > Field82 = "" and Field83 = "" and Field84 = "" and Field85 = "" and > Field86 = "" and Field87 = "" and Field88 = "" and Field89 = "" and > Field90 = "" and Field91 = "" and Field92 = "" and Field93 = "" and > Field94 = "" and Field95 = "" and Field96 = "" and Field97 = "" and > Field98 = "" and Field99 = "" and Field100 = "" and Field101 = "" and > Field102 = "" and Field103 = "" and Field104 = "" and Field105 = "" and > Field106 = "" and Field107 = "" and Field108 = "" and Field109 = "" and > Field110 = "" and Field111 = "" and Field112 = "" and Field113 = "" and > Field114 = "" and Field115 = "" and Field116 = "" and Field117 = "" and > Field118 = "" and Field119 = "" and Field120 = "" and Field121 = "" and > Field122 = "" and Field123 = "" and Field124 = "" and Field125 = "" and > Field126 = "" and Field127 = "" and Field128 = "" and Field129 = "" and > Field130 = "" and Field131 = "" and Field132 = "" and Field133 = "" and > Field134 = "" and Field135 = "" and Field136 = "" and Field137 = "" and > Field138 = "" and Field139 = "" and Field140 = "" and Field141 = "" and > Field142 = "" and Field143 = "" and Field144 = "" and
[jira] [Updated] (SPARK-48414) Fix breaking change in python's `fromJson`
[ https://issues.apache.org/jira/browse/SPARK-48414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48414: --- Labels: pull-request-available (was: ) > Fix breaking change in python's `fromJson` > -- > > Key: SPARK-48414 > URL: https://issues.apache.org/jira/browse/SPARK-48414 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48168) Add bitwise shifting operators support
[ https://issues.apache.org/jira/browse/SPARK-48168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-48168. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46440 [https://github.com/apache/spark/pull/46440] > Add bitwise shifting operators support > -- > > Key: SPARK-48168 > URL: https://issues.apache.org/jira/browse/SPARK-48168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48412) Refactor data type json parse
[ https://issues.apache.org/jira/browse/SPARK-48412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-48412. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46733 [https://github.com/apache/spark/pull/46733] > Refactor data type json parse > - > > Key: SPARK-48412 > URL: https://issues.apache.org/jira/browse/SPARK-48412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version
[ https://issues.apache.org/jira/browse/SPARK-48409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48409. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46704 [https://github.com/apache/spark/pull/46704] > Upgrade MySQL & Postgres & mariadb docker image version > --- > > Key: SPARK-48409 > URL: https://issues.apache.org/jira/browse/SPARK-48409 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version
[ https://issues.apache.org/jira/browse/SPARK-48409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-48409: Assignee: BingKun Pan > Upgrade MySQL & Postgres & mariadb docker image version > --- > > Key: SPARK-48409 > URL: https://issues.apache.org/jira/browse/SPARK-48409 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version
[ https://issues.apache.org/jira/browse/SPARK-48409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48409: --- Labels: pull-request-available (was: ) > Upgrade MySQL & Postgres & mariadb docker image version > --- > > Key: SPARK-48409 > URL: https://issues.apache.org/jira/browse/SPARK-48409 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48384) Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper`
[ https://issues.apache.org/jira/browse/SPARK-48384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-48384. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46695 [https://github.com/apache/spark/pull/46695] > Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper` > --- > > Key: SPARK-48384 > URL: https://issues.apache.org/jira/browse/SPARK-48384 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48384) Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper`
[ https://issues.apache.org/jira/browse/SPARK-48384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-48384: Assignee: BingKun Pan > Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper` > --- > > Key: SPARK-48384 > URL: https://issues.apache.org/jira/browse/SPARK-48384 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48413) ALTER COLUMN with collation
[ https://issues.apache.org/jira/browse/SPARK-48413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48413: --- Labels: pull-request-available (was: ) > ALTER COLUMN with collation > --- > > Key: SPARK-48413 > URL: https://issues.apache.org/jira/browse/SPARK-48413 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Labels: pull-request-available > > Add support for changing collation of a column with ALTER COLUMN command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48412) Refactor data type json parse
[ https://issues.apache.org/jira/browse/SPARK-48412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48412: --- Labels: pull-request-available (was: ) > Refactor data type json parse > - > > Key: SPARK-48412 > URL: https://issues.apache.org/jira/browse/SPARK-48412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48413) ALTER COLUMN with collation
[ https://issues.apache.org/jira/browse/SPARK-48413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-48413: -- Epic Link: SPARK-46830 > ALTER COLUMN with collation > --- > > Key: SPARK-48413 > URL: https://issues.apache.org/jira/browse/SPARK-48413 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > Add support for changing collation of a column with ALTER COLUMN command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48413) ALTER COLUMN with collation
Nikola Mandic created SPARK-48413: - Summary: ALTER COLUMN with collation Key: SPARK-48413 URL: https://issues.apache.org/jira/browse/SPARK-48413 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Add support for changing collation of a column with ALTER COLUMN command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48412) Refactor data type json parse
Ruifeng Zheng created SPARK-48412: - Summary: Refactor data type json parse Key: SPARK-48412 URL: https://issues.apache.org/jira/browse/SPARK-48412 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48410) Fix InitCap expression
[ https://issues.apache.org/jira/browse/SPARK-48410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48410: --- Labels: pull-request-available (was: ) > Fix InitCap expression > -- > > Key: SPARK-48410 > URL: https://issues.apache.org/jira/browse/SPARK-48410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
Wei Liu created SPARK-48411: --- Summary: Add E2E test for DropDuplicateWithinWatermark Key: SPARK-48411 URL: https://issues.apache.org/jira/browse/SPARK-48411 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 4.0.0 Reporter: Wei Liu Currently we do not have a e2e test for DropDuplicateWithinWatermark, we should add one. We can simply use one of the test written in Scala here (with the testStream API) and replicate it to python: [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] The change should happen in [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] so we can test it in both connect and non-connect. Test with: ``` python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming python/run-tests --testnames pyspark.sql.tests.connect.streaming.test_parity_streaming ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
[ https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849202#comment-17849202 ] Wei Liu commented on SPARK-48411: - [~liuyuchen777] is going to work on this > Add E2E test for DropDuplicateWithinWatermark > - > > Key: SPARK-48411 > URL: https://issues.apache.org/jira/browse/SPARK-48411 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > Currently we do not have a e2e test for DropDuplicateWithinWatermark, we > should add one. We can simply use one of the test written in Scala here (with > the testStream API) and replicate it to python: > [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] > > The change should happen in > [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] > > so we can test it in both connect and non-connect. > > Test with: > ``` > python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming > python/run-tests --testnames > pyspark.sql.tests.connect.streaming.test_parity_streaming > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47257) Assign error classes to ALTER COLUMN errors
[ https://issues.apache.org/jira/browse/SPARK-47257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47257: --- Labels: pull-request-available starter (was: starter) > Assign error classes to ALTER COLUMN errors > --- > > Key: SPARK-47257 > URL: https://issues.apache.org/jira/browse/SPARK-47257 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_105[3-4]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48403) Fix Upper & Lower expressions for UTF8_BINARY_LCASE
[ https://issues.apache.org/jira/browse/SPARK-48403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-48403: - Summary: Fix Upper & Lower expressions for UTF8_BINARY_LCASE (was: Fix Upper & Lower expressions for UTF8_BINARY_LCASE)) > Fix Upper & Lower expressions for UTF8_BINARY_LCASE > --- > > Key: SPARK-48403 > URL: https://issues.apache.org/jira/browse/SPARK-48403 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48410) Fix InitCap expression
Uroš Bojanić created SPARK-48410: Summary: Fix InitCap expression Key: SPARK-48410 URL: https://issues.apache.org/jira/browse/SPARK-48410 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48403) Fix Upper & Lower expressions for UTF8_BINARY_LCASE)
[ https://issues.apache.org/jira/browse/SPARK-48403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-48403: - Summary: Fix Upper & Lower expressions for UTF8_BINARY_LCASE) (was: Fix Upper, Lower, InitCap for UTF8_BINARY_LCASE) > Fix Upper & Lower expressions for UTF8_BINARY_LCASE) > > > Key: SPARK-48403 > URL: https://issues.apache.org/jira/browse/SPARK-48403 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version
BingKun Pan created SPARK-48409: --- Summary: Upgrade MySQL & Postgres & mariadb docker image version Key: SPARK-48409 URL: https://issues.apache.org/jira/browse/SPARK-48409 Project: Spark Issue Type: Improvement Components: Build, Tests Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46090) Support plan fragment level SQL configs in AQE
[ https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-46090: - Assignee: XiDuo You > Support plan fragment level SQL configs in AQE > --- > > Key: SPARK-46090 > URL: https://issues.apache.org/jira/browse/SPARK-46090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > > AQE executes query plan stage by stage, so there is a chance to support plan > fragment level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46090) Support plan fragment level SQL configs in AQE
[ https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-46090. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44013 [https://github.com/apache/spark/pull/44013] > Support plan fragment level SQL configs in AQE > --- > > Key: SPARK-46090 > URL: https://issues.apache.org/jira/browse/SPARK-46090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > AQE executes query plan stage by stage, so there is a chance to support plan > fragment level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48406) Upgrade commons-cli to 1.8.0
[ https://issues.apache.org/jira/browse/SPARK-48406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48406. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46727 [https://github.com/apache/spark/pull/46727] > Upgrade commons-cli to 1.8.0 > > > Key: SPARK-48406 > URL: https://issues.apache.org/jira/browse/SPARK-48406 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > * [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.0] > * > [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.8.0|https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.8] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48406) Upgrade commons-cli to 1.8.0
[ https://issues.apache.org/jira/browse/SPARK-48406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-48406: Assignee: Yang Jie > Upgrade commons-cli to 1.8.0 > > > Key: SPARK-48406 > URL: https://issues.apache.org/jira/browse/SPARK-48406 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > > * [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.0] > * > [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.8.0|https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.8] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org