[jira] [Resolved] (SPARK-48305) CurrentLike - Database/Schema, Catalog, User (all collations)

2024-05-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48305.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46613
[https://github.com/apache/spark/pull/46613]

> CurrentLike - Database/Schema, Catalog, User (all collations)
> -
>
> Key: SPARK-48305
> URL: https://issues.apache.org/jira/browse/SPARK-48305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception

2024-05-20 Thread Sumit Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Singh updated SPARK-48311:

Description: 
Steps to Reproduce 

1. Data creation
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, TimestampType, 
StringType
from datetime import datetime

# Define the schema
schema = StructType([
    StructField("col1", LongType(), nullable=True),
    StructField("col2", TimestampType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])

# Define the data
data = [
    (1, datetime(2023, 5, 15, 12, 30), "Discount"),
    (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
    (3, datetime(2023, 5, 17, 9, 15), "Coupon")
]

# Create the DataFrame
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView("temp_offers")

# Query the temporary table using SQL
# DISTINCT required to reproduce the issue. 
testDf = spark.sql("""
                    SELECT DISTINCT 
                    col1,
                    col2,
                    col3 FROM temp_offers
                    """) {code}
2. UDF registration 
{code:java}
import pyspark.sql.functions as F 
import pyspark.sql.types as T

#Creating udf functions 
def udf1(d):
    return d

def udf2(d):
    if d.isoweekday() in (1, 2, 3, 4):
        return 'WEEKDAY'
    else:
        return 'WEEKEND'

udf1_name = F.udf(udf1, T.TimestampType())
udf2_name = F.udf(udf2, T.StringType()) {code}
3. Adding UDF in grouping and agg
{code:java}
groupBy_cols = ['col1', 'col4', 'col5', 'col3']
temp = testDf \
  .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
udf2_name('col4').alias('col5')) 

result = 
(temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}
4. Result
{code:java}
result.show(5, False) {code}
*We get below error*
{code:java}
An error was encountered:
An error occurred while calling o1079.showString.
: java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
[col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
 {code}

  was:
Steps to Reproduce 

1. Data creation
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, TimestampType, 
StringType
from datetime import datetime

# Define the schema
schema = StructType([
    StructField("col1", LongType(), nullable=True),
    StructField("col2", TimestampType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])

# Define the data
data = [
    (1, datetime(2023, 5, 15, 12, 30), "Discount"),
    (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
    (3, datetime(2023, 5, 17, 9, 15), "Coupon")
]

# Create the DataFrame
df = spark.createDataFrame(data, 
schema)df.createOrReplaceTempView("temp_offers")

# Query the temporary table using SQL
# DISTINCT required to reproduce the issue. 
testDf = spark.sql("""
                    SELECT DISTINCT 
                    col1,
                    col2,
                    col3 FROM temp_offers
                    """) {code}
2. UDF registration 
{code:java}
import pyspark.sql.functions as F 
import pyspark.sql.types as T

#Creating udf functions 
def udf1(d):
    return d

def udf2(d):
    if d.isoweekday() in (1, 2, 3, 4):
        return 'WEEKDAY'
    else:
        return 'WEEKEND'

udf1_name = F.udf(udf1, T.TimestampType())
udf2_name = F.udf(udf2, T.StringType()) {code}
3. Adding UDF in grouping and agg
{code:java}
groupBy_cols = ['col1', 'col4', 'col5', 'col3']
temp = testDf \
  .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
udf2_name('col4').alias('col5')) 

result = 
(temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}
4. Result
{code:java}
result.show(5, False) {code}
*We get below error*
{code:java}
An error was encountered:
An error occurred while calling o1079.showString.
: java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
[col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
 {code}


> Nested pythonUDF in groupBy and aggregate result in Binding Exception 
> --
>
> Key: SPARK-48311
> URL: https://issues.apache.org/jira/browse/SPARK-48311
> Project: Spark
>  Issue Type: Bug
>  

[jira] [Assigned] (SPARK-48258) Implement DataFrame.checkpoint and DataFrame.localCheckpoint

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48258:
--

Assignee: (was: Apache Spark)

> Implement DataFrame.checkpoint and DataFrame.localCheckpoint
> 
>
> Key: SPARK-48258
> URL: https://issues.apache.org/jira/browse/SPARK-48258
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> We should add DataFrame.checkpoint and DataFrame.localCheckpoint for feature 
> parity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48307) InlineCTE should keep not-inlined relations in the original WithCTE node

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48307:
--

Assignee: Apache Spark

> InlineCTE should keep not-inlined relations in the original WithCTE node
> 
>
> Key: SPARK-48307
> URL: https://issues.apache.org/jira/browse/SPARK-48307
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48242) Upgrade extra-enforcer-rules to 1.8.0

2024-05-20 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48242:


Assignee: BingKun Pan

> Upgrade extra-enforcer-rules to 1.8.0
> -
>
> Key: SPARK-48242
> URL: https://issues.apache.org/jira/browse/SPARK-48242
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48242) Upgrade extra-enforcer-rules to 1.8.0

2024-05-20 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48242.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46538
[https://github.com/apache/spark/pull/46538]

> Upgrade extra-enforcer-rules to 1.8.0
> -
>
> Key: SPARK-48242
> URL: https://issues.apache.org/jira/browse/SPARK-48242
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48338) Sql Scripting support for Spark SQL

2024-05-20 Thread Aleksandar Tomic (Jira)
Aleksandar Tomic created SPARK-48338:


 Summary: Sql Scripting support for Spark SQL
 Key: SPARK-48338
 URL: https://issues.apache.org/jira/browse/SPARK-48338
 Project: Spark
  Issue Type: Epic
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Aleksandar Tomic


Design doc for this feature is in attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48338) Sql Scripting support for Spark SQL

2024-05-20 Thread Aleksandar Tomic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandar Tomic updated SPARK-48338:
-
Attachment: Sql Scripting - OSS.odt

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
> Attachments: Sql Scripting - OSS.odt
>
>
> Design doc for this feature is in attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48338) Sql Scripting support for Spark SQL

2024-05-20 Thread Aleksandar Tomic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandar Tomic updated SPARK-48338:
-
Description: 
Design doc for this feature is in attachment.

High level example of Sql Script:

```
BEGIN
  DECLARE c INT = 10;
  WHILE c > 0 DO
INSERT INTO tscript VALUES (c);
SET c = c - 1;
  END WHILE;
END
```

High level motivation behind this feature:
SQL Scripting gives customers the ability to develop complex ETL and analysis 
entirely in SQL. Until now, customers have had to write verbose SQL statements 
or combine SQL + Python to efficiently write business logic. Coming from 
another system, customers have to choose whether or not they want to migrate to 
pyspark. Some customers end up not using Spark because of this gap. SQL 
Scripting is a key milestone towards enabling SQL practitioners to write 
sophisticated queries, without the need to use pyspark. Further, SQL Scripting 
is a necessary step towards support for SQL Stored Procedures, and along with 
SQL Variables (released) and Temp Tables (in progress), will allow for more 
seamless data warehouse migrations.


  was:Design doc for this feature is in attachment.


> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
> Attachments: Sql Scripting - OSS.odt
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48339) Revert converting collated strings to regular strings when writing to hive metastore

2024-05-20 Thread Stefan Kandic (Jira)
Stefan Kandic created SPARK-48339:
-

 Summary: Revert converting collated strings to regular strings 
when writing to hive metastore
 Key: SPARK-48339
 URL: https://issues.apache.org/jira/browse/SPARK-48339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Stefan Kandic


No longer needed due to https://github.com/apache/spark/pull/46083



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48339) Revert converting collated strings to regular strings when writing to hive metastore

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48339:
---
Labels: pull-request-available  (was: )

> Revert converting collated strings to regular strings when writing to hive 
> metastore
> 
>
> Key: SPARK-48339
> URL: https://issues.apache.org/jira/browse/SPARK-48339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> No longer needed due to https://github.com/apache/spark/pull/46083



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48290) AQE not working when joining dataframes with more than 2000 partitions

2024-05-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

André F. updated SPARK-48290:
-
Component/s: SQL

> AQE not working when joining dataframes with more than 2000 partitions
> --
>
> Key: SPARK-48290
> URL: https://issues.apache.org/jira/browse/SPARK-48290
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.3.2, 3.5.1
> Environment: spark-standalone
> spark3.5.1
>Reporter: André F.
>Priority: Major
>
> We are joining 2 large dataframes with a considerable skew on the left side 
> in one specific key (>2000 skew ratio).
> {code:java}
> left side num partitions: 10335
> right side num partitions: 1241
> left side num rows: 20181947343
> right side num rows: 107462219 {code}
> Since we have `{{{}spark.sql.adaptive.enabled{}}} ` we expect AQE to act 
> during the join, dealing with the skewed partition automatically.
> During their join, we can see the following log indicating that the skew was 
> not detected since their statistics looks weirdly equal for min/median/max 
> sizes:
> {code:java}
> OptimizeSkewedJoin: number of skewed partitions: left 0, right 0
>  OptimizeSkewedJoin: 
> Optimizing skewed join.
> Left side partitions size info:
> median size: 780925482, max size: 780925482, min size: 780925482, avg size: 
> 780925482
> Right side partitions size info:
> median size: 3325797, max size: 3325797, min size: 3325797, avg size: 3325797
>{code}
> Looking at this log line and the spark configuration possibilities, our two 
> main hypotheses to work around this behavior and correctly detect the skew 
> were:
>  # Increasing the `minNumPartitionsToHighlyCompress` so that Spark doesn’t 
> convert the statistics into a `CompressedMapStatus` and therefore is able to 
> identify the skewed partition.
>  # Allowing spark to use a `HighlyCompressedMapStatus`, but change other 
> configurations such as `spark.shuffle.accurateBlockThreshold` and 
> `spark.shuffle.accurateBlockSkewedFactor` so that even then the size of the 
> skewed partitions/blocks is accurately registered and consequently used in 
> the optimization.
> We tried different values for `spark.shuffle.accurateBlockThreshold` (even 
> absurd ones like 1MB) and nothing seem to work. The statistics indicates that 
> the min/median and max are the same somehow and thus, the skew is not 
> detected.
> However, when forcibly reducing `spark.sql.shuffle.partitions` to less than 
> 2000 partitions, the statistics looked correct and the optimized skewed join 
> acts as it should:
> {code:java}
> OptimizeSkewedJoin: number of skewed partitions: left 1, right 0
> OptimizeSkewedJoin: Left side partition 42 (263 GB) is skewed, split it into 
> 337 parts.
> OptimizeSkewedJoin: 
> Optimizing skewed join.
> Left side partitions size info:
> median size: 862803419, max size: 282616632301, min size: 842320875, avg 
> size: 1019367139
> Right side partitions size info:
> median size: 4320067, max size: 4376957, min size: 4248989, avg size: 4319766 
> {code}
> Should we assume that the statistics are becoming corrupted when Spark uses 
> `HighlyCompressedMapStatus`? Should we try another configuration property to 
> try to work around this problem? (Assuming that fine tuning all dataframes in 
> skewed joins in our ETL to have less than 2000 partitions is not an option)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48340) Support TimestampNTZ infer schema miss prefer_timestamp_ntz

2024-05-20 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-48340:
--
Description: !image-2024-05-20-18-38-39-769.png|width=746,height=450!  
(was: !image-2024-05-20-18-38-22-486.png|width=378,height=227!)

> Support TimestampNTZ  infer schema miss prefer_timestamp_ntz
> 
>
> Key: SPARK-48340
> URL: https://issues.apache.org/jira/browse/SPARK-48340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2024-05-20-18-38-39-769.png
>
>
> !image-2024-05-20-18-38-39-769.png|width=746,height=450!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48340) Support TimestampNTZ infer schema miss prefer_timestamp_ntz

2024-05-20 Thread angerszhu (Jira)
angerszhu created SPARK-48340:
-

 Summary: Support TimestampNTZ  infer schema miss 
prefer_timestamp_ntz
 Key: SPARK-48340
 URL: https://issues.apache.org/jira/browse/SPARK-48340
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.5.1, 4.0.0
Reporter: angerszhu
 Attachments: image-2024-05-20-18-38-39-769.png

!image-2024-05-20-18-38-22-486.png|width=378,height=227!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48340) Support TimestampNTZ infer schema miss prefer_timestamp_ntz

2024-05-20 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-48340:
--
Attachment: image-2024-05-20-18-38-39-769.png

> Support TimestampNTZ  infer schema miss prefer_timestamp_ntz
> 
>
> Key: SPARK-48340
> URL: https://issues.apache.org/jira/browse/SPARK-48340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2024-05-20-18-38-39-769.png
>
>
> !image-2024-05-20-18-38-22-486.png|width=378,height=227!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48340) Support TimestampNTZ infer schema miss prefer_timestamp_ntz

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48340:
---
Labels: pull-request-available  (was: )

> Support TimestampNTZ  infer schema miss prefer_timestamp_ntz
> 
>
> Key: SPARK-48340
> URL: https://issues.apache.org/jira/browse/SPARK-48340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-20-18-38-39-769.png
>
>
> !image-2024-05-20-18-38-39-769.png|width=746,height=450!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48323) DB2: Map BooleanType to Boolean instead of Char(1)

2024-05-20 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48323.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46637
[https://github.com/apache/spark/pull/46637]

> DB2: Map BooleanType to Boolean instead of Char(1)
> --
>
> Key: SPARK-48323
> URL: https://issues.apache.org/jira/browse/SPARK-48323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48332) Upgrade `jdbc` related test dependencies

2024-05-20 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48332:


Assignee: BingKun Pan

> Upgrade `jdbc` related test dependencies
> 
>
> Key: SPARK-48332
> URL: https://issues.apache.org/jira/browse/SPARK-48332
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48332) Upgrade `jdbc` related test dependencies

2024-05-20 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48332.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46653
[https://github.com/apache/spark/pull/46653]

> Upgrade `jdbc` related test dependencies
> 
>
> Key: SPARK-48332
> URL: https://issues.apache.org/jira/browse/SPARK-48332
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48341) Allow Spark Connect plugins to use PlanTest in their tests

2024-05-20 Thread Tom van Bussel (Jira)
Tom van Bussel created SPARK-48341:
--

 Summary: Allow Spark Connect plugins to use PlanTest in their tests
 Key: SPARK-48341
 URL: https://issues.apache.org/jira/browse/SPARK-48341
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Tom van Bussel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48342) Parser support

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48342:
---

 Summary: Parser support
 Key: SPARK-48342
 URL: https://issues.apache.org/jira/browse/SPARK-48342
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Implement parser support for SQL scripting with all supporting changes for 
upcoming interpreter implementation and future extensions of the parser:
 * Parser
 * Parser testing
 * Support for SQL scripting exceptions.

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48342) Parser support

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48342:

Description: 
Implement parse for SQL scripting with all supporting changes for upcoming 
interpreter implementation and future extensions of the parser:
 * Parser
 * Parser testing
 * Support for SQL scripting exceptions.

 

For more details, design doc can be found in parent Jira item.

  was:
Implement parser support for SQL scripting with all supporting changes for 
upcoming interpreter implementation and future extensions of the parser:
 * Parser
 * Parser testing
 * Support for SQL scripting exceptions.

 

For more details, design doc can be found in parent Jira item.


> Parser support
> --
>
> Key: SPARK-48342
> URL: https://issues.apache.org/jira/browse/SPARK-48342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Implement parse for SQL scripting with all supporting changes for upcoming 
> interpreter implementation and future extensions of the parser:
>  * Parser
>  * Parser testing
>  * Support for SQL scripting exceptions.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48343) Interpreter support

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48343:
---

 Summary: Interpreter support
 Key: SPARK-48343
 URL: https://issues.apache.org/jira/browse/SPARK-48343
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Implement interpreter for SQL scripting:
 * Interpreter
 * Interpreter testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48343) Interpreter support

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48343:

Description: 
Implement interpreter for SQL scripting:
 * Interpreter
 * Interpreter testing

For more details, design doc can be found in parent Jira item.

  was:
Implement interpreter for SQL scripting:
 * Interpreter
 * Interpreter testing


> Interpreter support
> ---
>
> Key: SPARK-48343
> URL: https://issues.apache.org/jira/browse/SPARK-48343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Implement interpreter for SQL scripting:
>  * Interpreter
>  * Interpreter testing
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48344) Changes to sql() API

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48344:
---

 Summary: Changes to sql() API
 Key: SPARK-48344
 URL: https://issues.apache.org/jira/browse/SPARK-48344
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Implement changes to sql() API to support SQL script execution:
 * SparkSession changes
 * sql() API changes - iterate through the script, but return only last 
dataframe
 * Spark Config flag to enable/disable SQL scripting in sql() API
 * E2E testing

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48238) Spark fail to start due to class o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter

2024-05-20 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48238:


Assignee: Cheng Pan

> Spark fail to start due to class 
> o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter
> ---
>
> Key: SPARK-48238
> URL: https://issues.apache.org/jira/browse/SPARK-48238
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Blocker
>  Labels: pull-request-available
>
> I tested the latest master branch, it failed to start on YARN mode
> {code:java}
> dev/make-distribution.sh --tgz -Phive,hive-thriftserver,yarn{code}
>  
> {code:java}
> $ bin/spark-sql --master yarn
> WARNING: Using incubator modules: jdk.incubator.vector
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2024-05-10 17:58:17 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2024-05-10 17:58:18 WARN Client: Neither spark.yarn.jars nor 
> spark.yarn.archive} is set, falling back to uploading libraries under 
> SPARK_HOME.
> 2024-05-10 17:58:25 ERROR SparkContext: Error initializing SparkContext.
> org.sparkproject.jetty.util.MultiException: Multiple exceptions
>     at 
> org.sparkproject.jetty.util.MultiException.ifExceptionThrow(MultiException.java:117)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:751)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:902)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:306)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:514) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2$adapted(SparkUI.scala:81)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:935) 
> ~[scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1$adapted(SparkUI.scala:79)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.ui.SparkUI.attachAllHandlers(SparkUI.scala:79) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext.$anonfun$new$31(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.SparkContext.$anonfun$new$31$adapted(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.SparkContext.(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2963) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1118)
>  ~[spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.getOrElse(Option.scala:201) [scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1112)
>  [spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64)
>  [spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHO

[jira] [Resolved] (SPARK-48238) Spark fail to start due to class o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter

2024-05-20 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48238.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46611
[https://github.com/apache/spark/pull/46611]

> Spark fail to start due to class 
> o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter
> ---
>
> Key: SPARK-48238
> URL: https://issues.apache.org/jira/browse/SPARK-48238
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I tested the latest master branch, it failed to start on YARN mode
> {code:java}
> dev/make-distribution.sh --tgz -Phive,hive-thriftserver,yarn{code}
>  
> {code:java}
> $ bin/spark-sql --master yarn
> WARNING: Using incubator modules: jdk.incubator.vector
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2024-05-10 17:58:17 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2024-05-10 17:58:18 WARN Client: Neither spark.yarn.jars nor 
> spark.yarn.archive} is set, falling back to uploading libraries under 
> SPARK_HOME.
> 2024-05-10 17:58:25 ERROR SparkContext: Error initializing SparkContext.
> org.sparkproject.jetty.util.MultiException: Multiple exceptions
>     at 
> org.sparkproject.jetty.util.MultiException.ifExceptionThrow(MultiException.java:117)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:751)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:902)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:306)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:514) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2$adapted(SparkUI.scala:81)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:935) 
> ~[scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1$adapted(SparkUI.scala:79)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.ui.SparkUI.attachAllHandlers(SparkUI.scala:79) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext.$anonfun$new$31(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.SparkContext.$anonfun$new$31$adapted(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.SparkContext.(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2963) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1118)
>  ~[spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.getOrElse(Option.scala:201) [scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1112)
>  [spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apach

[jira] [Created] (SPARK-48345) Checks for variable declarations

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48345:
---

 Summary: Checks for variable declarations
 Key: SPARK-48345
 URL: https://issues.apache.org/jira/browse/SPARK-48345
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Add checks to parser (visitBatchBody() in AstBuilder) for variable 
declarations, based on a passed-in flag:
 * Variable can be declared only at the beginning of the compound.
 * Support for exception when wrong variable declaration is encountered.

 

For more details, design doc can be found in parent Jira item.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48346) Support for IF ELSE statements

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48346:
---

 Summary: Support for IF ELSE statements
 Key: SPARK-48346
 URL: https://issues.apache.org/jira/browse/SPARK-48346
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Add support for IF ELSE statements to SQL scripting parser & interpreter:
 * IF
 * IF / ELSE
 * IF / ELSE IF / ELSE

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48347) Support for WHILE statements

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48347:
---

 Summary: Support for WHILE statements
 Key: SPARK-48347
 URL: https://issues.apache.org/jira/browse/SPARK-48347
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Add support for WHILE statements to SQL scripting parser & interpreter.

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48348) Support for BREAK/CONTINUE statements

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48348:
---

 Summary: Support for BREAK/CONTINUE statements
 Key: SPARK-48348
 URL: https://issues.apache.org/jira/browse/SPARK-48348
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Add support for BREAK and CONTINUE statements in WHILE loops to SQL scripting 
parser & interpreter.

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48349) Support for debugging

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48349:
---

 Summary: Support for debugging
 Key: SPARK-48349
 URL: https://issues.apache.org/jira/browse/SPARK-48349
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


TBD.
Probably to be separated into multiple subtasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48350) Support for error handling

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48350:
---

 Summary: Support for error handling
 Key: SPARK-48350
 URL: https://issues.apache.org/jira/browse/SPARK-48350
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


In general, add support for SQL scripting related exceptions.
By the time someone starts working on this item, some exception support might 
already exist - check if it needs refactoring.

 

Have in might that for some (all?) exceptions we might need to know which 
line(s) in the script are responsible for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48351) JDBC Connectors - Add cast suite

2024-05-20 Thread Uros Stankovic (Jira)
Uros Stankovic created SPARK-48351:
--

 Summary: JDBC Connectors - Add cast suite
 Key: SPARK-48351
 URL: https://issues.apache.org/jira/browse/SPARK-48351
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Uros Stankovic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48350) Support for exceptions

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48350:

Summary: Support for exceptions  (was: Support for error handling)

> Support for exceptions
> --
>
> Key: SPARK-48350
> URL: https://issues.apache.org/jira/browse/SPARK-48350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> In general, add support for SQL scripting related exceptions.
> By the time someone starts working on this item, some exception support might 
> already exist - check if it needs refactoring.
>  
> Have in might that for some (all?) exceptions we might need to know which 
> line(s) in the script are responsible for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48352) set max file counter through spark conf

2024-05-20 Thread guihuawen (Jira)
guihuawen created SPARK-48352:
-

 Summary: set max file counter through spark conf
 Key: SPARK-48352
 URL: https://issues.apache.org/jira/browse/SPARK-48352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: guihuawen
 Fix For: 4.0.0


Now we can set  spark.sql.files.maxRecordsPerFile through spark conf.

But MAX_FILE_COUNTER can not.

Set as default parameter.

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48350) Support for exceptions thrown from parser/interpreter

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48350:

Summary: Support for exceptions thrown from parser/interpreter  (was: 
Support for exceptions)

> Support for exceptions thrown from parser/interpreter
> -
>
> Key: SPARK-48350
> URL: https://issues.apache.org/jira/browse/SPARK-48350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> In general, add support for SQL scripting related exceptions.
> By the time someone starts working on this item, some exception support might 
> already exist - check if it needs refactoring.
>  
> Have in might that for some (all?) exceptions we might need to know which 
> line(s) in the script are responsible for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48353) Support for TRY/CATCH

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48353:
---

 Summary: Support for TRY/CATCH
 Key: SPARK-48353
 URL: https://issues.apache.org/jira/browse/SPARK-48353
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Details TBD.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48352) set max file counter through spark conf

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48352:
---
Labels: pull-request-available  (was: )

> set max file counter through spark conf
> ---
>
> Key: SPARK-48352
> URL: https://issues.apache.org/jira/browse/SPARK-48352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: guihuawen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Now we can set  spark.sql.files.maxRecordsPerFile through spark conf.
> But MAX_FILE_COUNTER can not.
> Set as default parameter.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48351) JDBC Connectors - Add cast suite

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48351:
---
Labels: pull-request-available  (was: )

> JDBC Connectors - Add cast suite
> 
>
> Key: SPARK-48351
> URL: https://issues.apache.org/jira/browse/SPARK-48351
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48353) Support for TRY/CATCH

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48353:

Description: 
Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]

 

 

  was:
Details TBD.

 

 


> Support for TRY/CATCH
> -
>
> Key: SPARK-48353
> URL: https://issues.apache.org/jira/browse/SPARK-48353
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48353) Support for TRY/CATCH

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48353:

Description: 
Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]

 

 

  was:
Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]

 

 


> Support for TRY/CATCH
> -
>
> Key: SPARK-48353
> URL: https://issues.apache.org/jira/browse/SPARK-48353
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48353) Support for SIGNAL and TRY/CATCH

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48353:

Summary: Support for SIGNAL and TRY/CATCH  (was: Support for TRY/CATCH)

> Support for SIGNAL and TRY/CATCH
> 
>
> Key: SPARK-48353
> URL: https://issues.apache.org/jira/browse/SPARK-48353
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48354) Added function support and testing for filter push down in JDBC connectors

2024-05-20 Thread Stefan Bukorovic (Jira)
Stefan Bukorovic created SPARK-48354:


 Summary: Added function support and testing for filter push down 
in JDBC connectors
 Key: SPARK-48354
 URL: https://issues.apache.org/jira/browse/SPARK-48354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.3
Reporter: Stefan Bukorovic


There is a possibility to add a support for push down of multiple widely used 
spark functions such as lower or upper... to JDBC data sources. 
Also, more integration tests are needed for push down of filters. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48355) Support for CASE

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48355:
---

 Summary: Support for CASE
 Key: SPARK-48355
 URL: https://issues.apache.org/jira/browse/SPARK-48355
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48356) Support for FOR loop

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48356:
---

 Summary: Support for FOR loop
 Key: SPARK-48356
 URL: https://issues.apache.org/jira/browse/SPARK-48356
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Details TBD.

 

For more details:

Design doc in parent Jira item.
[SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48356) Support for FOR loop

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48356:

Description: 
Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

  was:
Details TBD.

 

For more details:

Design doc in parent Jira item.
[SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].


> Support for FOR loop
> 
>
> Key: SPARK-48356
> URL: https://issues.apache.org/jira/browse/SPARK-48356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48346) Support for IF ELSE statement

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48346:

Summary: Support for IF ELSE statement  (was: Support for IF ELSE 
statements)

> Support for IF ELSE statement
> -
>
> Key: SPARK-48346
> URL: https://issues.apache.org/jira/browse/SPARK-48346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for IF ELSE statements to SQL scripting parser & interpreter:
>  * IF
>  * IF / ELSE
>  * IF / ELSE IF / ELSE
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48356) Support for FOR statement

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48356:

Summary: Support for FOR statement  (was: Support for FOR loop)

> Support for FOR statement
> -
>
> Key: SPARK-48356
> URL: https://issues.apache.org/jira/browse/SPARK-48356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48355) Support for CASE statement

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48355:

Summary: Support for CASE statement  (was: Support for CASE)

> Support for CASE statement
> --
>
> Key: SPARK-48355
> URL: https://issues.apache.org/jira/browse/SPARK-48355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48348) Support for BREAK/CONTINUE statement

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48348:

Summary: Support for BREAK/CONTINUE statement  (was: Support for 
BREAK/CONTINUE statements)

> Support for BREAK/CONTINUE statement
> 
>
> Key: SPARK-48348
> URL: https://issues.apache.org/jira/browse/SPARK-48348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for BREAK and CONTINUE statements in WHILE loops to SQL scripting 
> parser & interpreter.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48347) Support for WHILE statement

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48347:

Summary: Support for WHILE statement  (was: Support for WHILE statements)

> Support for WHILE statement
> ---
>
> Key: SPARK-48347
> URL: https://issues.apache.org/jira/browse/SPARK-48347
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for WHILE statements to SQL scripting parser & interpreter.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48353) Support for SIGNAL and TRY/CATCH statements

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48353:

Summary: Support for SIGNAL and TRY/CATCH statements  (was: Support for 
SIGNAL and TRY/CATCH)

> Support for SIGNAL and TRY/CATCH statements
> ---
>
> Key: SPARK-48353
> URL: https://issues.apache.org/jira/browse/SPARK-48353
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48357) Support for LEAVE and LOOP statements

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48357:
---

 Summary: Support for LEAVE and LOOP statements
 Key: SPARK-48357
 URL: https://issues.apache.org/jira/browse/SPARK-48357
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48357) Support for LEAVE and LOOP statements

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Description: 
Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

> Support for LEAVE and LOOP statements
> -
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48358) Support for REPEAT statement

2024-05-20 Thread David Milicevic (Jira)
David Milicevic created SPARK-48358:
---

 Summary: Support for REPEAT statement
 Key: SPARK-48358
 URL: https://issues.apache.org/jira/browse/SPARK-48358
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48342) Parser support

2024-05-20 Thread David Milicevic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847892#comment-17847892
 ] 

David Milicevic commented on SPARK-48342:
-

Working on this: https://github.com/apache/spark/pull/46665

> Parser support
> --
>
> Key: SPARK-48342
> URL: https://issues.apache.org/jira/browse/SPARK-48342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Implement parse for SQL scripting with all supporting changes for upcoming 
> interpreter implementation and future extensions of the parser:
>  * Parser
>  * Parser testing
>  * Support for SQL scripting exceptions.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48348) Support for BREAK and CONTINUE statements

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48348:

Summary: Support for BREAK and CONTINUE statements  (was: Support for 
BREAK/CONTINUE statement)

> Support for BREAK and CONTINUE statements
> -
>
> Key: SPARK-48348
> URL: https://issues.apache.org/jira/browse/SPARK-48348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for BREAK and CONTINUE statements in WHILE loops to SQL scripting 
> parser & interpreter.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48359) Built-in functions for Zstd compression and decompression

2024-05-20 Thread Xi Lyu (Jira)
Xi Lyu created SPARK-48359:
--

 Summary: Built-in functions for Zstd compression and decompression
 Key: SPARK-48359
 URL: https://issues.apache.org/jira/browse/SPARK-48359
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Xi Lyu


Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * input: The binary value to compress or decompress.
 * level: Optional integer argument that represents the compression level. The 
compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * steaming_mode: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48354) Added function support and testing for filter push down in JDBC connectors

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48354:
---
Labels: pull-request-available  (was: )

> Added function support and testing for filter push down in JDBC connectors
> --
>
> Key: SPARK-48354
> URL: https://issues.apache.org/jira/browse/SPARK-48354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Stefan Bukorovic
>Priority: Major
>  Labels: pull-request-available
>
> There is a possibility to add a support for push down of multiple widely used 
> spark functions such as lower or upper... to JDBC data sources. 
> Also, more integration tests are needed for push down of filters. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48357) Support for LOOP and LEAVE/ITERATE statements

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Summary: Support for LOOP and LEAVE/ITERATE statements  (was: Support for 
LEAVE and LOOP statements)

> Support for LOOP and LEAVE/ITERATE statements
> -
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48357) Support for LOOP and LEAVE/ITERATE statements

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Description: 
Details TBD.

 

ITERATE should be the same as CONTINUE, LEAVE as BREAK?

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

  was:
Details TBD.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].


> Support for LOOP and LEAVE/ITERATE statements
> -
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> ITERATE should be the same as CONTINUE, LEAVE as BREAK?
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48357) Support for LOOP and LEAVE/ITERATE statements

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Description: 
Details TBD.

Maybe split to multiple items?

 

ITERATE should be the same as CONTINUE, LEAVE as BREAK?

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

  was:
Details TBD.

 

ITERATE should be the same as CONTINUE, LEAVE as BREAK?

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].


> Support for LOOP and LEAVE/ITERATE statements
> -
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
> Maybe split to multiple items?
>  
> ITERATE should be the same as CONTINUE, LEAVE as BREAK?
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48357) Support for LOOP and LEAVE/ITERATE statements

2024-05-20 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Description: 
Details TBD.

Maybe split to multiple items?

 

ITERATE should be the equivalent to CONTINUE, LEAVE to BREAK?

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

  was:
Details TBD.

Maybe split to multiple items?

 

ITERATE should be the same as CONTINUE, LEAVE as BREAK?

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].


> Support for LOOP and LEAVE/ITERATE statements
> -
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
> Maybe split to multiple items?
>  
> ITERATE should be the equivalent to CONTINUE, LEAVE to BREAK?
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48275) Improve array_sort documentation for default comparator

2024-05-20 Thread Matt Braymer-Hayes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Braymer-Hayes updated SPARK-48275:
---
Summary: Improve array_sort documentation for default comparator  (was: 
array_sort and sort_array fail for structs containing any unorderable fields)

> Improve array_sort documentation for default comparator
> ---
>
> Key: SPARK-48275
> URL: https://issues.apache.org/jira/browse/SPARK-48275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Running in Databricks on [Databricks Runtime 
> 15.1|https://docs.databricks.com/en/release-notes/runtime/15.1.html].
>Reporter: Matt Braymer-Hayes
>Priority: Critical
>
> {{When array_sort()}} and {{sort_array()}} are used on arrays of structs, the 
> first field is used to determine ordering. Unfortunately, it appears that the 
> current implementation requires _all_ fields to be orderable. Here's a 
> minimal example:
>  
> {code:java}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> schema = T.StructType([
> T.StructField(
> 'value',
> T.ArrayType(
> T.StructType([
> T.StructField('order', T.IntegerType(), True),
> T.StructField('values', T.MapType(T.StringType(), 
> T.StringType(), True), True), # remove this field and both commands below 
> succeed
> ]),
> False
> ),
> False
> )
> ])
> df = spark.createDataFrame([], schema=schema)
> df.select(F.array_sort(df['value'])).collect()
> df.select(F.sort_array(df['value'])).collect(){code}
>  
> {{array_sort()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
> "(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
> The `<` does not support ordering on type "STRUCT uncomparable: MAP>". SQLSTATE: 42K09 {code}
>  
>  
> {{sort_array()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "sort_array(value, 
> true)" due to data type mismatch: The `sort_array` does not support ordering 
> on type "ARRAY>>". 
> SQLSTATE: 42K09 {code}
>  
>  
> I would expect the behavior to be different: this error should be raised if 
> the _first field_ is not orderable, otherwise it should succeed.
>  
> With the implementation as-is, I can't sort arrays of structs that contain 
> maps or other unorderable fields. The only workaround is painful: you'd have 
> to store the maps as a separate array, store the map array index on the 
> struct, and after sorting add the map to the struct using {{transform()}} and 
> {{{}element_at(){}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48275) Improve array_sort documentation for default comparator

2024-05-20 Thread Matt Braymer-Hayes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Braymer-Hayes updated SPARK-48275:
---
Issue Type: Documentation  (was: Bug)

> Improve array_sort documentation for default comparator
> ---
>
> Key: SPARK-48275
> URL: https://issues.apache.org/jira/browse/SPARK-48275
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Running in Databricks on [Databricks Runtime 
> 15.1|https://docs.databricks.com/en/release-notes/runtime/15.1.html].
>Reporter: Matt Braymer-Hayes
>Priority: Critical
>
> {{When array_sort()}} and {{sort_array()}} are used on arrays of structs, the 
> first field is used to determine ordering. Unfortunately, it appears that the 
> current implementation requires _all_ fields to be orderable. Here's a 
> minimal example:
>  
> {code:java}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> schema = T.StructType([
> T.StructField(
> 'value',
> T.ArrayType(
> T.StructType([
> T.StructField('order', T.IntegerType(), True),
> T.StructField('values', T.MapType(T.StringType(), 
> T.StringType(), True), True), # remove this field and both commands below 
> succeed
> ]),
> False
> ),
> False
> )
> ])
> df = spark.createDataFrame([], schema=schema)
> df.select(F.array_sort(df['value'])).collect()
> df.select(F.sort_array(df['value'])).collect(){code}
>  
> {{array_sort()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
> "(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
> The `<` does not support ordering on type "STRUCT uncomparable: MAP>". SQLSTATE: 42K09 {code}
>  
>  
> {{sort_array()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "sort_array(value, 
> true)" due to data type mismatch: The `sort_array` does not support ordering 
> on type "ARRAY>>". 
> SQLSTATE: 42K09 {code}
>  
>  
> I would expect the behavior to be different: this error should be raised if 
> the _first field_ is not orderable, otherwise it should succeed.
>  
> With the implementation as-is, I can't sort arrays of structs that contain 
> maps or other unorderable fields. The only workaround is painful: you'd have 
> to store the maps as a separate array, store the map array index on the 
> struct, and after sorting add the map to the struct using {{transform()}} and 
> {{{}element_at(){}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48275) array_sort: Improve documentation for default comparator's behavior for different types

2024-05-20 Thread Matt Braymer-Hayes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Braymer-Hayes updated SPARK-48275:
---
Summary: array_sort: Improve documentation for default comparator's 
behavior for different types  (was: Improve array_sort documentation for 
default comparator)

> array_sort: Improve documentation for default comparator's behavior for 
> different types
> ---
>
> Key: SPARK-48275
> URL: https://issues.apache.org/jira/browse/SPARK-48275
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Running in Databricks on [Databricks Runtime 
> 15.1|https://docs.databricks.com/en/release-notes/runtime/15.1.html].
>Reporter: Matt Braymer-Hayes
>Priority: Trivial
>
> {{When array_sort()}} and {{sort_array()}} are used on arrays of structs, the 
> first field is used to determine ordering. Unfortunately, it appears that the 
> current implementation requires _all_ fields to be orderable. Here's a 
> minimal example:
>  
> {code:java}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> schema = T.StructType([
> T.StructField(
> 'value',
> T.ArrayType(
> T.StructType([
> T.StructField('order', T.IntegerType(), True),
> T.StructField('values', T.MapType(T.StringType(), 
> T.StringType(), True), True), # remove this field and both commands below 
> succeed
> ]),
> False
> ),
> False
> )
> ])
> df = spark.createDataFrame([], schema=schema)
> df.select(F.array_sort(df['value'])).collect()
> df.select(F.sort_array(df['value'])).collect(){code}
>  
> {{array_sort()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
> "(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
> The `<` does not support ordering on type "STRUCT uncomparable: MAP>". SQLSTATE: 42K09 {code}
>  
>  
> {{sort_array()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "sort_array(value, 
> true)" due to data type mismatch: The `sort_array` does not support ordering 
> on type "ARRAY>>". 
> SQLSTATE: 42K09 {code}
>  
>  
> I would expect the behavior to be different: this error should be raised if 
> the _first field_ is not orderable, otherwise it should succeed.
>  
> With the implementation as-is, I can't sort arrays of structs that contain 
> maps or other unorderable fields. The only workaround is painful: you'd have 
> to store the maps as a separate array, store the map array index on the 
> struct, and after sorting add the map to the struct using {{transform()}} and 
> {{{}element_at(){}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48275) Improve array_sort documentation for default comparator

2024-05-20 Thread Matt Braymer-Hayes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Braymer-Hayes updated SPARK-48275:
---
Priority: Trivial  (was: Critical)

> Improve array_sort documentation for default comparator
> ---
>
> Key: SPARK-48275
> URL: https://issues.apache.org/jira/browse/SPARK-48275
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Running in Databricks on [Databricks Runtime 
> 15.1|https://docs.databricks.com/en/release-notes/runtime/15.1.html].
>Reporter: Matt Braymer-Hayes
>Priority: Trivial
>
> {{When array_sort()}} and {{sort_array()}} are used on arrays of structs, the 
> first field is used to determine ordering. Unfortunately, it appears that the 
> current implementation requires _all_ fields to be orderable. Here's a 
> minimal example:
>  
> {code:java}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> schema = T.StructType([
> T.StructField(
> 'value',
> T.ArrayType(
> T.StructType([
> T.StructField('order', T.IntegerType(), True),
> T.StructField('values', T.MapType(T.StringType(), 
> T.StringType(), True), True), # remove this field and both commands below 
> succeed
> ]),
> False
> ),
> False
> )
> ])
> df = spark.createDataFrame([], schema=schema)
> df.select(F.array_sort(df['value'])).collect()
> df.select(F.sort_array(df['value'])).collect(){code}
>  
> {{array_sort()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
> "(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
> The `<` does not support ordering on type "STRUCT uncomparable: MAP>". SQLSTATE: 42K09 {code}
>  
>  
> {{sort_array()}} output:
>  
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "sort_array(value, 
> true)" due to data type mismatch: The `sort_array` does not support ordering 
> on type "ARRAY>>". 
> SQLSTATE: 42K09 {code}
>  
>  
> I would expect the behavior to be different: this error should be raised if 
> the _first field_ is not orderable, otherwise it should succeed.
>  
> With the implementation as-is, I can't sort arrays of structs that contain 
> maps or other unorderable fields. The only workaround is painful: you'd have 
> to store the maps as a separate array, store the map array index on the 
> struct, and after sorting add the map to the struct using {{transform()}} and 
> {{{}element_at(){}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48360) Simplify conditionals containing predicates

2024-05-20 Thread Thomas Powell (Jira)
Thomas Powell created SPARK-48360:
-

 Summary: Simplify conditionals containing predicates
 Key: SPARK-48360
 URL: https://issues.apache.org/jira/browse/SPARK-48360
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.3
Reporter: Thomas Powell


The catalyst optimizer has many optimizations for {{CaseWhen}} and {{If}} 
expressions, that can eliminate branches entirely or replace them with Boolean 
logic. There are additional "always false" conditionals that could be 
eliminated entirely. It would also be possible to replace conditionals with 
Boolean logic where the {{if-branch}} and {{else-branch}} are themselves 
predicates. The primary motivation would be to push more filters to the 
datasource.

For example:
{code:java}
Filter(If(GreaterThan(a, 2), false, LessThanOrEqual(b <= 4))){code}
is equivalent to
{code:java}
# a not nullable
Filter(And(LessThanOrEqual(a, 2), LessThanOrEqual(b, 4))

# a nullable
Filter(And(Not(EqualNotSafe(GreaterThan(a, 2), true), LessThanOrEqual(b, 4 
{code}
Within a filter the nullability handling is admittedly less important since the 
expression evaluating to null would be semantically equivalent to false, but 
the original conditional may have been intentionally written to not return null 
when {{a}} may be null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48275) array_sort: Improve documentation for default comparator's behavior for different types

2024-05-20 Thread Matt Braymer-Hayes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Braymer-Hayes updated SPARK-48275:
---
Description: 
h1. tl;dr

It would be helpful for the documentation for array_sort() to include the 
default comparator behavior for different array element types, especially 
structs. It would also be helpful for the 
{{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE }}error to recommend using a custom 
comparator instead of the default comparator, especially when sorting on a 
complex type (e.g., a struct containing an unorderable field, like a map).

 

h1. Background

The default comparator for {{array_sort()}} for struct elements is to sort by 
every field in the struct in schema order (i.e., ORDER BY field1, field2, ..., 
fieldN). This requires every field to be orderable: if they aren't an error 
occurs.

 

Here's a small example:
{code:java}
import pyspark.sql.functions as F
import pyspark.sql.types as T

schema = T.StructType([
T.StructField(
'value',
T.ArrayType(
T.StructType([
T.StructField('orderable', T.IntegerType(), True),
T.StructField('unorderable', T.MapType(T.StringType(), 
T.StringType(), True), True), # remove this field and both commands below 
succeed
]),
False
),
False
)
])
df = spark.createDataFrame([], schema=schema)

df.select(F.array_sort(df['value'])).collect(){code}
Output:
{code:java}
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
"(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code}
 

If the default comparator doesn't work for a user (e.g., they have an 
unorderable field like a map in their struct), array_sort() accepts a custom 
comparator, where users can order array elements however they like.

 

Building on the previous example:

 
{code:java}
import pyspark.sql as psql


def comparator(l: psql.Column, r: psql.Column) -> psql.Column:
"""Order structs l and r by order field.
Rules:
* Nulls are last
* In ascending order
"""
return (
F.when(l['order'].isNull() & r['order'].isNull(), 0)
.when(l['order'].isNull(), 1)
.when(r['order'].isNull(), -1)
.when(l['order'] < r['order'], -1)
.when(l['order'] == r['order'], 0)
.otherwise(1)
)

df.select(F.array_sort(df['value'], comparator)).collect(){code}
This works as intended.

 

h1. Ask

The documentation for array_sort() should include information on the behavior 
of the default comparator for various datatypes. For example, it would be 
helpful to know that the default comparator for structs compares all fields in 
schema order (i.e., {{{}ORDER BY field1, field2, ..., fieldN{}}}).

 

Additionally, when users passes an incomparable type 

 

  was:
{{When array_sort()}} and {{sort_array()}} are used on arrays of structs, the 
first field is used to determine ordering. Unfortunately, it appears that the 
current implementation requires _all_ fields to be orderable. Here's a minimal 
example:

 
{code:java}
import pyspark.sql.functions as F
import pyspark.sql.types as T

schema = T.StructType([
T.StructField(
'value',
T.ArrayType(
T.StructType([
T.StructField('order', T.IntegerType(), True),
T.StructField('values', T.MapType(T.StringType(), 
T.StringType(), True), True), # remove this field and both commands below 
succeed
]),
False
),
False
)
])
df = spark.createDataFrame([], schema=schema)

df.select(F.array_sort(df['value'])).collect()
df.select(F.sort_array(df['value'])).collect(){code}
 

{{array_sort()}} output:

 
{code:java}
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
"(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code}
 

 

{{sort_array()}} output:

 
{code:java}
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "sort_array(value, 
true)" due to data type mismatch: The `sort_array` does not support ordering on 
type "ARRAY>>". 
SQLSTATE: 42K09 {code}
 

 

I would expect the behavior to be different: this error should be raised if the 
_first field_ is not orderable, otherwise it should succeed.

 

With the implementation as-is, I can't sort arrays of structs that contain maps 
or other unorderable fields. The only workaround is painful: you'd have to 
store the maps as a separate array, store the map array index on the struct, 
and after sorting add the map to the struct using {{transform()}} and 
{{{}element_at(){}}}.

 


> array_sort: Improve documentation for default comparator's behavior for 
> different types
> ---

[jira] [Updated] (SPARK-48360) Simplify conditionals with predicate branches

2024-05-20 Thread Thomas Powell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Powell updated SPARK-48360:
--
Summary: Simplify conditionals with predicate branches  (was: Simplify 
conditionals containing predicates)

> Simplify conditionals with predicate branches
> -
>
> Key: SPARK-48360
> URL: https://issues.apache.org/jira/browse/SPARK-48360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Thomas Powell
>Priority: Major
>
> The catalyst optimizer has many optimizations for {{CaseWhen}} and {{If}} 
> expressions, that can eliminate branches entirely or replace them with 
> Boolean logic. There are additional "always false" conditionals that could be 
> eliminated entirely. It would also be possible to replace conditionals with 
> Boolean logic where the {{if-branch}} and {{else-branch}} are themselves 
> predicates. The primary motivation would be to push more filters to the 
> datasource.
> For example:
> {code:java}
> Filter(If(GreaterThan(a, 2), false, LessThanOrEqual(b <= 4))){code}
> is equivalent to
> {code:java}
> # a not nullable
> Filter(And(LessThanOrEqual(a, 2), LessThanOrEqual(b, 4))
> # a nullable
> Filter(And(Not(EqualNotSafe(GreaterThan(a, 2), true), LessThanOrEqual(b, 
> 4 {code}
> Within a filter the nullability handling is admittedly less important since 
> the expression evaluating to null would be semantically equivalent to false, 
> but the original conditional may have been intentionally written to not 
> return null when {{a}} may be null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48275) array_sort: Improve documentation for default comparator's behavior for different types

2024-05-20 Thread Matt Braymer-Hayes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Braymer-Hayes updated SPARK-48275:
---
Description: 
h1. tl;dr

It would be helpful for the documentation for array_sort() to include the 
default comparator behavior for different array element types, especially 
structs. It would also be helpful for the 
\{{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE }}error to recommend using a custom 
comparator instead of the default comparator, especially when sorting on a 
complex type (e.g., a struct containing an unorderable field, like a map).

 

h1. Background

The default comparator for {{array_sort()}} for struct elements is to sort by 
every field in the struct in schema order (i.e., ORDER BY field1, field2, ..., 
fieldN). This requires every field to be orderable: if they aren't an error 
occurs.

 

Here's a small example:
{code:java}
import pyspark.sql.functions as F
import pyspark.sql.types as T

schema = T.StructType([
T.StructField(
'value',
T.ArrayType(
T.StructType([
T.StructField('orderable', T.IntegerType(), True),
T.StructField('unorderable', T.MapType(T.StringType(), 
T.StringType(), True), True), # remove this field and both commands below 
succeed
]),
False
),
False
)
])
df = spark.createDataFrame([], schema=schema)

df.select(F.array_sort(df['value'])).collect(){code}
Output:
{code:java}
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
"(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code}
 

If the default comparator doesn't work for a user (e.g., they have an 
unorderable field like a map in their struct), array_sort() accepts a custom 
comparator, where users can order array elements however they like.

 

Building on the previous example:

 
{code:java}
import pyspark.sql as psql


def comparator(l: psql.Column, r: psql.Column) -> psql.Column:
"""Order structs l and r by order field.
Rules:
* Nulls are last
* In ascending order
"""
return (
F.when(l['order'].isNull() & r['order'].isNull(), 0)
.when(l['order'].isNull(), 1)
.when(r['order'].isNull(), -1)
.when(l['order'] < r['order'], -1)
.when(l['order'] == r['order'], 0)
.otherwise(1)
)

df.select(F.array_sort(df['value'], comparator)).collect(){code}
This works as intended.

 

h1. Ask

The documentation for {{array_sort()}} should include information on the 
behavior of the default comparator for various datatypes. For the 
array-of-unorderable-structs example, it would be helpful to know that the 
default comparator for structs compares all fields in schema order (i.e., 
{{{}ORDER BY field1, field2, ..., fieldN{}}}).

 

Additionally, when users passes an unorderable type to array_sort() and uses 
the default comparator, the returned error should recommend the user use a 
custom comparator instead.

  was:
h1. tl;dr

It would be helpful for the documentation for array_sort() to include the 
default comparator behavior for different array element types, especially 
structs. It would also be helpful for the 
{{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE }}error to recommend using a custom 
comparator instead of the default comparator, especially when sorting on a 
complex type (e.g., a struct containing an unorderable field, like a map).

 

h1. Background

The default comparator for {{array_sort()}} for struct elements is to sort by 
every field in the struct in schema order (i.e., ORDER BY field1, field2, ..., 
fieldN). This requires every field to be orderable: if they aren't an error 
occurs.

 

Here's a small example:
{code:java}
import pyspark.sql.functions as F
import pyspark.sql.types as T

schema = T.StructType([
T.StructField(
'value',
T.ArrayType(
T.StructType([
T.StructField('orderable', T.IntegerType(), True),
T.StructField('unorderable', T.MapType(T.StringType(), 
T.StringType(), True), True), # remove this field and both commands below 
succeed
]),
False
),
False
)
])
df = spark.createDataFrame([], schema=schema)

df.select(F.array_sort(df['value'])).collect(){code}
Output:
{code:java}
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
"(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code}
 

If the default comparator doesn't work for a user (e.g., they have an 
unorderable field like a map in their struct), array_sort() accepts a custom 
comparator, where users can order array elements however they like.

 

Building on the previous example:

 
{code:java}
import pyspark.sql as psql


def com

[jira] [Commented] (SPARK-48275) array_sort: Improve documentation for default comparator's behavior for different types

2024-05-20 Thread Matt Braymer-Hayes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847902#comment-17847902
 ] 

Matt Braymer-Hayes commented on SPARK-48275:


Hi folks,

 

Original description was a user error on my part. Updated from a critical bug 
to a trivial doc fix.

> array_sort: Improve documentation for default comparator's behavior for 
> different types
> ---
>
> Key: SPARK-48275
> URL: https://issues.apache.org/jira/browse/SPARK-48275
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Running in Databricks on [Databricks Runtime 
> 15.1|https://docs.databricks.com/en/release-notes/runtime/15.1.html].
>Reporter: Matt Braymer-Hayes
>Priority: Trivial
>
> h1. tl;dr
> It would be helpful for the documentation for array_sort() to include the 
> default comparator behavior for different array element types, especially 
> structs. It would also be helpful for the 
> \{{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE }}error to recommend using a 
> custom comparator instead of the default comparator, especially when sorting 
> on a complex type (e.g., a struct containing an unorderable field, like a 
> map).
>  
> 
> h1. Background
> The default comparator for {{array_sort()}} for struct elements is to sort by 
> every field in the struct in schema order (i.e., ORDER BY field1, field2, 
> ..., fieldN). This requires every field to be orderable: if they aren't an 
> error occurs.
>  
> Here's a small example:
> {code:java}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> schema = T.StructType([
> T.StructField(
> 'value',
> T.ArrayType(
> T.StructType([
> T.StructField('orderable', T.IntegerType(), True),
> T.StructField('unorderable', T.MapType(T.StringType(), 
> T.StringType(), True), True), # remove this field and both commands below 
> succeed
> ]),
> False
> ),
> False
> )
> ])
> df = spark.createDataFrame([], schema=schema)
> df.select(F.array_sort(df['value'])).collect(){code}
> Output:
> {code:java}
> [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
> "(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
> The `<` does not support ordering on type "STRUCT unorderable: MAP>". SQLSTATE: 42K09 {code}
>  
> If the default comparator doesn't work for a user (e.g., they have an 
> unorderable field like a map in their struct), array_sort() accepts a custom 
> comparator, where users can order array elements however they like.
>  
> Building on the previous example:
>  
> {code:java}
> import pyspark.sql as psql
> def comparator(l: psql.Column, r: psql.Column) -> psql.Column:
> """Order structs l and r by order field.
> Rules:
> * Nulls are last
> * In ascending order
> """
> return (
> F.when(l['order'].isNull() & r['order'].isNull(), 0)
> .when(l['order'].isNull(), 1)
> .when(r['order'].isNull(), -1)
> .when(l['order'] < r['order'], -1)
> .when(l['order'] == r['order'], 0)
> .otherwise(1)
> )
> df.select(F.array_sort(df['value'], comparator)).collect(){code}
> This works as intended.
>  
> 
> h1. Ask
> The documentation for {{array_sort()}} should include information on the 
> behavior of the default comparator for various datatypes. For the 
> array-of-unorderable-structs example, it would be helpful to know that the 
> default comparator for structs compares all fields in schema order (i.e., 
> {{{}ORDER BY field1, field2, ..., fieldN{}}}).
>  
> Additionally, when users passes an unorderable type to array_sort() and uses 
> the default comparator, the returned error should recommend the user use a 
> custom comparator instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48360) Simplify conditionals with predicate branches

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48360:
---
Labels: pull-request-available  (was: )

> Simplify conditionals with predicate branches
> -
>
> Key: SPARK-48360
> URL: https://issues.apache.org/jira/browse/SPARK-48360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Thomas Powell
>Priority: Major
>  Labels: pull-request-available
>
> The catalyst optimizer has many optimizations for {{CaseWhen}} and {{If}} 
> expressions, that can eliminate branches entirely or replace them with 
> Boolean logic. There are additional "always false" conditionals that could be 
> eliminated entirely. It would also be possible to replace conditionals with 
> Boolean logic where the {{if-branch}} and {{else-branch}} are themselves 
> predicates. The primary motivation would be to push more filters to the 
> datasource.
> For example:
> {code:java}
> Filter(If(GreaterThan(a, 2), false, LessThanOrEqual(b <= 4))){code}
> is equivalent to
> {code:java}
> # a not nullable
> Filter(And(LessThanOrEqual(a, 2), LessThanOrEqual(b, 4))
> # a nullable
> Filter(And(Not(EqualNotSafe(GreaterThan(a, 2), true), LessThanOrEqual(b, 
> 4 {code}
> Within a filter the nullability handling is admittedly less important since 
> the expression evaluating to null would be semantically equivalent to false, 
> but the original conditional may have been intentionally written to not 
> return null when {{a}} may be null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48342) Parser support

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48342:
---
Labels: pull-request-available  (was: )

> Parser support
> --
>
> Key: SPARK-48342
> URL: https://issues.apache.org/jira/browse/SPARK-48342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Implement parse for SQL scripting with all supporting changes for upcoming 
> interpreter implementation and future extensions of the parser:
>  * Parser
>  * Parser testing
>  * Support for SQL scripting exceptions.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48359) Built-in functions for Zstd compression and decompression

2024-05-20 Thread Xi Lyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-48359:
---
Description: 
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * input: The binary value to compress or decompress.
 * level: Optional integer argument that represents the compression level. The 
compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * streaming_mode: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.

  was:
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * input: The binary value to compress or decompress.
 * level: Optional integer argument that represents the compression level. The 
compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * steaming_mode: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.


> Built-in functions for Zstd compression and decompression
> -
>
> Key: SPARK-48359
> URL: https://issues.apache.org/jira/browse/SPARK-48359
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
>
> Some users are using UDFs for Zstd compression and decompression, which 
> results in poor performance. If we provide native functions, the performance 
> will be improved by compressing and decompressing just within the JVM.
>  
> Now, we are introducing three new built-in functions:
> {code:java}
> zstd_compress(input: binary [, level: int [, steaming_mode: bool]])
> zstd_decompress(input: binary)
> try_zstd_decompress(input: binary)
> {code}
> where
>  * input: The binary value to compress or decompress.
>  * level: Optional integer argument that represents the compression level. 
> The compression level controls the trade-off between compression speed and 
> compression ratio. The default level is 3. Valid values: between 1 and 22 
> inclusive
>  * streaming_mode: Optional boolean argument that represents whether to use 
> streaming mode to compress. 
> Examples:
> {code:sql}
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
>   KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
>   KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QU=
> > SELECT 
> > string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
>   Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache 
> Spark Apache Spark Apache Spark Apache Spark Apache Spark
> > SELE

[jira] [Updated] (SPARK-48359) Built-in functions for Zstd compression and decompression

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48359:
---
Labels: pull-request-available  (was: )

> Built-in functions for Zstd compression and decompression
> -
>
> Key: SPARK-48359
> URL: https://issues.apache.org/jira/browse/SPARK-48359
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
>  Labels: pull-request-available
>
> Some users are using UDFs for Zstd compression and decompression, which 
> results in poor performance. If we provide native functions, the performance 
> will be improved by compressing and decompressing just within the JVM.
>  
> Now, we are introducing three new built-in functions:
> {code:java}
> zstd_compress(input: binary [, level: int [, steaming_mode: bool]])
> zstd_decompress(input: binary)
> try_zstd_decompress(input: binary)
> {code}
> where
>  * input: The binary value to compress or decompress.
>  * level: Optional integer argument that represents the compression level. 
> The compression level controls the trade-off between compression speed and 
> compression ratio. The default level is 3. Valid values: between 1 and 22 
> inclusive
>  * streaming_mode: Optional boolean argument that represents whether to use 
> streaming mode to compress. 
> Examples:
> {code:sql}
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
>   KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
>   KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QU=
> > SELECT 
> > string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
>   Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache 
> Spark Apache Spark Apache Spark Apache Spark Apache Spark
> > SELECT zstd_decompress(zstd_compress("Apache Spark"));
>   Apache Spark
> > SELECT try_zstd_decompress("invalid input")
>   NULL
> {code}
> These three built-in functions are also available in Python and Scala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-20 Thread Ted Chester Jenks (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Chester Jenks updated SPARK-48361:
--
Description: 
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*

 
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*

 
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))

dfWithJagged.show(){code}
*Returns:*
{code:java}
+---+---+---++---+---+
|column1|column2|column3| column4|_corrupt_record|__is_jagged|
+---+---+---++---+---+
|   four|    5.0|    6.0|   seven|           NULL|      false|
|      8|    9.0|   NULL|    NULL|            8,9|       true|
|    ten|   11.0|   12.0|thirteen|           NULL|      false|
+---+---+---++---+---+ {code}
So far so good...

 

*BUT*

 

*If we add an aggregate before we show:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
  
# sum up column1
val groupedSum = 
dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2"))

groupedSum.show(){code}
*We get:*
{code:java}
+---+---+
|column1|sum_column2|
+---+---+
|      8|        9.0|
|   four|        5.0|
|    ten|       11.0|
+---+---+ {code}
*Which is not correct*

 

With the addition of the aggregate, the filter down to rows with 3 commas in 
the corrupt record column is ignored. This does not happed with any other 
operators I have tried - just aggregates so far.

 

 

 

  was:
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*

 
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*

 
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithC

[jira] [Created] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-20 Thread Ted Chester Jenks (Jira)
Ted Chester Jenks created SPARK-48361:
-

 Summary: Correctness: CSV corrupt record filter with aggregate 
ignored
 Key: SPARK-48361
 URL: https://issues.apache.org/jira/browse/SPARK-48361
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.1
 Environment: Using spark shell 3.5.1 on M1 Mac
Reporter: Ted Chester Jenks


Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*

 
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*

 
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))

dfWithJagged.show(){code}
*Returns:*
{code:java}
+---+---+---++---+---+
|column1|column2|column3| column4|_corrupt_record|__is_jagged|
+---+---+---++---+---+
|   four|    5.0|    6.0|   seven|           NULL|      false|
|      8|    9.0|   NULL|    NULL|            8,9|       true|
|    ten|   11.0|   12.0|thirteen|           NULL|      false|
+---+---+---++---+---+ {code}
So far so good...

 

*BUT*

 

*If we add an aggregate before we show:*

*In spark-shell:*

 
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
  
# sum up column1
val groupedSum = 
dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2"))

groupedSum.show(){code}
*We get:*
{code:java}
+---+---+
|column1|sum_column2|
+---+---+
|      8|        9.0|
|   four|        5.0|
|    ten|       11.0|
+---+---+ {code}
*Which is not correct*

 

With the addition of the aggregate, the filter down to rows with 3 commas in 
the corrupt record column is ignored. This does not happed with any other 
operators I have tried - just aggregates so far.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-20 Thread Ted Chester Jenks (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Chester Jenks updated SPARK-48361:
--
Description: 
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))

dfWithJagged.show(){code}
*Returns:*
{code:java}
+---+---+---++---+---+
|column1|column2|column3| column4|_corrupt_record|__is_jagged|
+---+---+---++---+---+
|   four|    5.0|    6.0|   seven|           NULL|      false|
|      8|    9.0|   NULL|    NULL|            8,9|       true|
|    ten|   11.0|   12.0|thirteen|           NULL|      false|
+---+---+---++---+---+ {code}
So far so good...

 

*BUT*

 

*If we add an aggregate before we show:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
  
# sum up column1
val groupedSum = 
dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2"))

groupedSum.show(){code}
*We get:*
{code:java}
+---+---+
|column1|sum_column2|
+---+---+
|      8|        9.0|
|   four|        5.0|
|    ten|       11.0|
+---+---+ {code}
 

*Which is not correct*

 

With the addition of the aggregate, the filter down to rows with 3 commas in 
the corrupt record column is ignored. This does not happed with any other 
operators I have tried - just aggregates so far.

 

 

 

  was:
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*

 
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*

 
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorr

[jira] [Resolved] (SPARK-48017) Add Spark application submission worker for operator

2024-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48017.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 10
[https://github.com/apache/spark-kubernetes-operator/pull/10]

> Add Spark application submission worker for operator
> 
>
> Key: SPARK-48017
> URL: https://issues.apache.org/jira/browse/SPARK-48017
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark Operator needs a submission worker that converts it's application 
> abstraction (Operator API) to k8s resources. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48017) Add Spark application submission worker for operator

2024-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48017:
-

Assignee: Zhou JIANG

> Add Spark application submission worker for operator
> 
>
> Key: SPARK-48017
> URL: https://issues.apache.org/jira/browse/SPARK-48017
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>
> Spark Operator needs a submission worker that converts it's application 
> abstraction (Operator API) to k8s resources. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48328) Upgrade `Arrow` to 16.1.0

2024-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48328.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46646
[https://github.com/apache/spark/pull/46646]

> Upgrade `Arrow` to 16.1.0
> -
>
> Key: SPARK-48328
> URL: https://issues.apache.org/jira/browse/SPARK-48328
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-20 Thread Szehon Ho (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847962#comment-17847962
 ] 

Szehon Ho commented on SPARK-48329:
---

Oh sorry I just saw this, I was about to make this pr but was waiting on 
internal review.  I think yours works too, but I also made some simplification 
in the test suites.  Maybe we can be co-author here.

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-20 Thread Szehon Ho (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847962#comment-17847962
 ] 

Szehon Ho edited comment on SPARK-48329 at 5/20/24 6:58 PM:


Oh sorry I just saw this and also made a pr, I was going to to make this pr 
earlier but was waiting on internal review.  I think yours works too, but I 
also made some simplification in the test suites.  Maybe we can be co-author 
here.


was (Author: szehon):
Oh sorry I just saw this, I was about to make this pr but was waiting on 
internal review.  I think yours works too, but I also made some simplification 
in the test suites.  Maybe we can be co-author here.

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-20 Thread Szehon Ho (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847963#comment-17847963
 ] 

Szehon Ho commented on SPARK-48329:
---

I cherry picked your doc change to my pr to be co-author, is it ok?

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48359) Built-in functions for Zstd compression and decompression

2024-05-20 Thread Xi Lyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-48359:
---
Description: 
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * input: The binary value to compress or decompress.
 * level: Optional integer argument that represents the compression level. The 
compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * streaming_mode: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.

  was:
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * input: The binary value to compress or decompress.
 * level: Optional integer argument that represents the compression level. The 
compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * streaming_mode: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.


> Built-in functions for Zstd compression and decompression
> -
>
> Key: SPARK-48359
> URL: https://issues.apache.org/jira/browse/SPARK-48359
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
>  Labels: pull-request-available
>
> Some users are using UDFs for Zstd compression and decompression, which 
> results in poor performance. If we provide native functions, the performance 
> will be improved by compressing and decompressing just within the JVM.
>  
> Now, we are introducing three new built-in functions:
> {code:java}
> zstd_compress(input: binary [, level: int [, steaming_mode: bool]])
> zstd_decompress(input: binary)
> try_zstd_decompress(input: binary)
> {code}
> where
>  * input: The binary value to compress or decompress.
>  * level: Optional integer argument that represents the compression level. 
> The compression level controls the trade-off between compression speed and 
> compression ratio. The default level is 3. Valid values: between 1 and 22 
> inclusive
>  * streaming_mode: Optional boolean argument that represents whether to use 
> streaming mode to compress. 
> Examples:
> {code:sql}
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
>   KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
>   KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> > SELECT 
> > string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
>   Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache 
> Spark Apach

[jira] [Updated] (SPARK-48359) Built-in functions for Zstd compression and decompression

2024-05-20 Thread Xi Lyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-48359:
---
Description: 
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * `input`: The binary value to compress or decompress.
 * `level`: Optional integer argument that represents the compression level. 
The compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * `streaming_mode`: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.

  was:
Some users are using UDFs for Zstd compression and decompression, which results 
in poor performance. If we provide native functions, the performance will be 
improved by compressing and decompressing just within the JVM.

 

Now, we are introducing three new built-in functions:
{code:java}
zstd_compress(input: binary [, level: int [, steaming_mode: bool]])

zstd_decompress(input: binary)

try_zstd_decompress(input: binary)
{code}
where
 * input: The binary value to compress or decompress.
 * level: Optional integer argument that represents the compression level. The 
compression level controls the trade-off between compression speed and 
compression ratio. The default level is 3. Valid values: between 1 and 22 
inclusive
 * streaming_mode: Optional boolean argument that represents whether to use 
streaming mode to compress. 

Examples:
{code:sql}
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
  KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
  KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> SELECT 
> string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
  Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark 
Apache Spark Apache Spark Apache Spark Apache Spark
> SELECT zstd_decompress(zstd_compress("Apache Spark"));
  Apache Spark
> SELECT try_zstd_decompress("invalid input")
  NULL
{code}
These three built-in functions are also available in Python and Scala.


> Built-in functions for Zstd compression and decompression
> -
>
> Key: SPARK-48359
> URL: https://issues.apache.org/jira/browse/SPARK-48359
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
>  Labels: pull-request-available
>
> Some users are using UDFs for Zstd compression and decompression, which 
> results in poor performance. If we provide native functions, the performance 
> will be improved by compressing and decompressing just within the JVM.
>  
> Now, we are introducing three new built-in functions:
> {code:java}
> zstd_compress(input: binary [, level: int [, steaming_mode: bool]])
> zstd_decompress(input: binary)
> try_zstd_decompress(input: binary)
> {code}
> where
>  * `input`: The binary value to compress or decompress.
>  * `level`: Optional integer argument that represents the compression level. 
> The compression level controls the trade-off between compression speed and 
> compression ratio. The default level is 3. Valid values: between 1 and 22 
> inclusive
>  * `streaming_mode`: Optional boolean argument that represents whether to use 
> streaming mode to compress. 
> Examples:
> {code:sql}
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10)));
>   KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=
> > SELECT base64(zstd_compress(repeat("Apache Spark ", 10), 3, true));
>   KLUv/QBYpAAAaEFwYWNoZSBTcGFyayABABLS+QUBAAA=
> > SELECT 
> > string(zstd_decompress(unbase64("KLUv/SCCpQAAaEFwYWNoZSBTcGFyayABABLS+QU=")));
>   Apache Spark Apache Spark Apache Spark Apache Spark Apache Spark Apach

[jira] [Created] (SPARK-48362) Add CollectSetWIthLimit

2024-05-20 Thread Holden Karau (Jira)
Holden Karau created SPARK-48362:


 Summary: Add CollectSetWIthLimit
 Key: SPARK-48362
 URL: https://issues.apache.org/jira/browse/SPARK-48362
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Holden Karau


See 
[https://stackoverflow.com/questions/38730912/how-to-limit-functions-collect-set-in-spark-sql]

 

Some users want to collect a set but if the number of distinct elements is too 
large they may get a Cannot grow BufferHolder  error from trying to collect the 
set then trim it.

 

We should offer a collect set which pre-emptively does not add more elements 
than needed to reduce the amount of memory used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48351) JDBC Connectors - Add cast suite

2024-05-20 Thread Uros Stankovic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uros Stankovic updated SPARK-48351:
---

Wont't do now

> JDBC Connectors - Add cast suite
> 
>
> Key: SPARK-48351
> URL: https://issues.apache.org/jira/browse/SPARK-48351
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-20 Thread Szehon Ho (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847963#comment-17847963
 ] 

Szehon Ho edited comment on SPARK-48329 at 5/20/24 10:37 PM:
-

I cherry picked your doc change to my pr.  I think it should be work to also 
add you as co-author, is it ok?


was (Author: szehon):
I cherry picked your doc change to my pr to be co-author, is it ok?

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval

2024-05-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48330:


Assignee: Chaoqin Li

> Fix the python streaming data source timeout issue for large trigger interval
> -
>
> Key: SPARK-48330
> URL: https://issues.apache.org/jira/browse/SPARK-48330
> Project: Spark
>  Issue Type: Task
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Currently we run long running python worker process for python streaming 
> source and sink to perform planning, commit and abort in driver side. Testing 
> indicate that current implementation cause connection timeout error when 
> streaming query has large trigger interval
> For python streaming source, keep the long running worker archaetecture but 
> set the socket timeout to be infinity to avoid timeout error.
> For python streaming sink, since StreamingWrite is also created per 
> microbatch in scala side, long running worker cannot be attached to s 
> StreamingWrite instance. Therefore we abandon the long running worker 
> architecture, simply call commit() or abort() and exit the worker and allow 
> spark to reuse worker for us.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval

2024-05-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48330.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46651
[https://github.com/apache/spark/pull/46651]

> Fix the python streaming data source timeout issue for large trigger interval
> -
>
> Key: SPARK-48330
> URL: https://issues.apache.org/jira/browse/SPARK-48330
> Project: Spark
>  Issue Type: Task
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently we run long running python worker process for python streaming 
> source and sink to perform planning, commit and abort in driver side. Testing 
> indicate that current implementation cause connection timeout error when 
> streaming query has large trigger interval
> For python streaming source, keep the long running worker archaetecture but 
> set the socket timeout to be infinity to avoid timeout error.
> For python streaming sink, since StreamingWrite is also created per 
> microbatch in scala side, long running worker cannot be attached to s 
> StreamingWrite instance. Therefore we abandon the long running worker 
> architecture, simply call commit() or abort() and exit the worker and allow 
> spark to reuse worker for us.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48363) Cleanup some redundant codes in `from_xml`

2024-05-20 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48363:
---

 Summary: Cleanup some redundant codes in `from_xml`
 Key: SPARK-48363
 URL: https://issues.apache.org/jira/browse/SPARK-48363
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48363) Cleanup some redundant codes in `from_xml`

2024-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48363:
---
Labels: pull-request-available  (was: )

> Cleanup some redundant codes in `from_xml`
> --
>
> Key: SPARK-48363
> URL: https://issues.apache.org/jira/browse/SPARK-48363
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48258) Implement DataFrame.checkpoint and DataFrame.localCheckpoint

2024-05-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48258.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46570
[https://github.com/apache/spark/pull/46570]

> Implement DataFrame.checkpoint and DataFrame.localCheckpoint
> 
>
> Key: SPARK-48258
> URL: https://issues.apache.org/jira/browse/SPARK-48258
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We should add DataFrame.checkpoint and DataFrame.localCheckpoint for feature 
> parity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-20 Thread chesterxu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848011#comment-17848011
 ] 

chesterxu commented on SPARK-48329:
---

ok, co-author is acceptable and the rest of the PR is belong to you~

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48340) Support TimestampNTZ infer schema miss prefer_timestamp_ntz

2024-05-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48340:


Assignee: angerszhu

> Support TimestampNTZ  infer schema miss prefer_timestamp_ntz
> 
>
> Key: SPARK-48340
> URL: https://issues.apache.org/jira/browse/SPARK-48340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-20-18-38-39-769.png
>
>
> !image-2024-05-20-18-38-39-769.png|width=746,height=450!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48340) Support TimestampNTZ infer schema miss prefer_timestamp_ntz

2024-05-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48340.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 4
[https://github.com/apache/spark/pull/4]

> Support TimestampNTZ  infer schema miss prefer_timestamp_ntz
> 
>
> Key: SPARK-48340
> URL: https://issues.apache.org/jira/browse/SPARK-48340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: image-2024-05-20-18-38-39-769.png
>
>
> !image-2024-05-20-18-38-39-769.png|width=746,height=450!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48363) Cleanup some redundant codes in `from_xml`

2024-05-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48363:


Assignee: BingKun Pan

> Cleanup some redundant codes in `from_xml`
> --
>
> Key: SPARK-48363
> URL: https://issues.apache.org/jira/browse/SPARK-48363
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >