[jira] [Updated] (SPARK-48499) Use Math.abs to get positive numbers

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48499:
---
Labels: pull-request-available  (was: )

> Use Math.abs to get positive numbers
> 
>
> Key: SPARK-48499
> URL: https://issues.apache.org/jira/browse/SPARK-48499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Junqing Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48499) Use Math.abs to get positive numbers

2024-05-31 Thread Junqing Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junqing Li updated SPARK-48499:
---
Summary: Use Math.abs to get positive numbers  (was: Use Math.abs to get 
Unsigned Numbers)

> Use Math.abs to get positive numbers
> 
>
> Key: SPARK-48499
> URL: https://issues.apache.org/jira/browse/SPARK-48499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Junqing Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48499) Use Math.abs to get Unsigned Numbers

2024-05-31 Thread Junqing Li (Jira)
Junqing Li created SPARK-48499:
--

 Summary: Use Math.abs to get Unsigned Numbers
 Key: SPARK-48499
 URL: https://issues.apache.org/jira/browse/SPARK-48499
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Junqing Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48497) Add user guide for batch data source write API

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48497:
---
Labels: pull-request-available  (was: )

> Add user guide for batch data source write API
> --
>
> Key: SPARK-48497
> URL: https://issues.apache.org/jira/browse/SPARK-48497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add examples for batch data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48497) Add user guide for batch data source write API

2024-05-31 Thread Allison Wang (Jira)
Allison Wang created SPARK-48497:


 Summary: Add user guide for batch data source write API
 Key: SPARK-48497
 URL: https://issues.apache.org/jira/browse/SPARK-48497
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add examples for batch data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48490) Unescapes any literals for message of MessageWithContext

2024-05-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-48490:
--

Assignee: BingKun Pan

> Unescapes any literals for message of MessageWithContext
> 
>
> Key: SPARK-48490
> URL: https://issues.apache.org/jira/browse/SPARK-48490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48490) Unescapes any literals for message of MessageWithContext

2024-05-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-48490.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46824
[https://github.com/apache/spark/pull/46824]

> Unescapes any literals for message of MessageWithContext
> 
>
> Key: SPARK-48490
> URL: https://issues.apache.org/jira/browse/SPARK-48490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate

2024-05-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48391:
---

Assignee: jiahong.li

> use addAll instead of add function  in TaskMetrics  to accelerate
> -
>
> Key: SPARK-48391
> URL: https://issues.apache.org/jira/browse/SPARK-48391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: jiahong.li
>Assignee: jiahong.li
>Priority: Major
>  Labels: pull-request-available
>
> In the fromAccumulators method of TaskMetrics,we should use `
> tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as 
> _externalAccums is a instance of CopyOnWriteArrayList



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate

2024-05-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48391.
-
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46705
[https://github.com/apache/spark/pull/46705]

> use addAll instead of add function  in TaskMetrics  to accelerate
> -
>
> Key: SPARK-48391
> URL: https://issues.apache.org/jira/browse/SPARK-48391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: jiahong.li
>Assignee: jiahong.li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> In the fromAccumulators method of TaskMetrics,we should use `
> tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as 
> _externalAccums is a instance of CopyOnWriteArrayList



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48466) Wrap empty relation propagation in a dedicated node

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48466:
---
Labels: pull-request-available  (was: )

> Wrap empty relation propagation in a dedicated node
> ---
>
> Key: SPARK-48466
> URL: https://issues.apache.org/jira/browse/SPARK-48466
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ziqi Liu
>Priority: Major
>  Labels: pull-request-available
>
> Currently we replace with a LocalTableScan in case of empty relation 
> propagation, which lost the information about the original query plan and 
> make it less human readable. The idea is to create a dedicated 
> `EmptyRelation` node which is a lead node but wraps the original query plan 
> inside.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48491) Refactor HiveWindowFunctionQuerySuite in BeforeAll and AfterAll

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48491:
---
Labels: pull-request-available  (was: )

> Refactor HiveWindowFunctionQuerySuite in BeforeAll and AfterAll
> ---
>
> Key: SPARK-48491
> URL: https://issues.apache.org/jira/browse/SPARK-48491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48495) Document planned approach to shredding

2024-05-31 Thread David Cashman (Jira)
David Cashman created SPARK-48495:
-

 Summary: Document planned approach to shredding
 Key: SPARK-48495
 URL: https://issues.apache.org/jira/browse/SPARK-48495
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: David Cashman






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48465) Avoid no-op empty relation propagation in AQE

2024-05-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48465.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46814
[https://github.com/apache/spark/pull/46814]

> Avoid no-op empty relation propagation in AQE
> -
>
> Key: SPARK-48465
> URL: https://issues.apache.org/jira/browse/SPARK-48465
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ziqi Liu
>Assignee: Ziqi Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We should avoid no-op empty relation propagation in AQE: if we convert an 
> empty QueryStageExec to empty relation, it will further wrapped into a new 
> query stage and execute -> produce empty result -> empty relation propagation 
> again. This issue is currently not exposed because AQE will try to reuse 
> shuffle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48494) Update airlift:aircompressor to 0.27

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48494:
---
Labels: pull-request-available  (was: )

> Update airlift:aircompressor to 0.27
> 
>
> Key: SPARK-48494
> URL: https://issues.apache.org/jira/browse/SPARK-48494
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> [CVE-2024-36114|https://www.cve.org/CVERecord?id=CVE-2024-36114]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance

2024-05-31 Thread Luca Canali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-48493:

Description: 
This enhancement adds an option to the Python Datasource Reader to yield Arrow 
batches directly, significantly boosting performance compared to using tuples 
or Rows. This implementation leverages the existing work with MapInArrow (see 
SPARK-46253 ).

Tests with a custom Python Datasource for High Energy Physics (HEP) data using 
the ROOT format reader showed an 8x speed increase when using Arrow batches 
over the traditional method of feeding data via tuples.

Additional context:
 * The ROOT data format is widely used in High Energy Physics (HEP) with 
approximately 1 exabyte of ROOT data currently in existence.
 * You can easily read ROOT data using libraries from the Python ecosystem, 
notably {{uproot}} and {{{}awkward-array{}}}. These libraries facilitate the 
reading of ROOT data and its conversion to Arrow among other formats.
 * You can write a simple ROOT data source using the Python datasource API. 
While this may not be optimal for performance, it is easy to implement and can 
leverage the mentioned libraries.
 * For better performance, ingest data via Arrow batches rather than row by 
row, as the latter method is significantly slower (an initial test showed it to 
be 8 times slower).
 * Arrow is very popular now, and this enhancement can benefit other 
communities beyond HEP that use Arrow for efficient data processing.

This enhancement will provide substantial performance improvements and make it 
easier to work with HEP data and other data types using Apache Spark.

  was:
 
This proposes an enhancement to the Python Datasource Reader by adding an 
option to yield Arrow batches directly, significantly boosting performance 
compared to using tuples or Rows. This implementation uses the existing work 
with MapInArrow (see SPARK-46253 ).

In tests with a custom Python Datasource for High Energy Physics data (ROOT 
format reader), using Arrow batches has demonstrated an 8x speed increase over 
the traditional method of feeding data via tuples.


> Enhance Python Datasource Reader with Arrow Batch Support for Improved 
> Performance
> --
>
> Key: SPARK-48493
> URL: https://issues.apache.org/jira/browse/SPARK-48493
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Luca Canali
>Priority: Minor
>  Labels: pull-request-available
>
> This enhancement adds an option to the Python Datasource Reader to yield 
> Arrow batches directly, significantly boosting performance compared to using 
> tuples or Rows. This implementation leverages the existing work with 
> MapInArrow (see SPARK-46253 ).
> Tests with a custom Python Datasource for High Energy Physics (HEP) data 
> using the ROOT format reader showed an 8x speed increase when using Arrow 
> batches over the traditional method of feeding data via tuples.
> Additional context:
>  * The ROOT data format is widely used in High Energy Physics (HEP) with 
> approximately 1 exabyte of ROOT data currently in existence.
>  * You can easily read ROOT data using libraries from the Python ecosystem, 
> notably {{uproot}} and {{{}awkward-array{}}}. These libraries facilitate the 
> reading of ROOT data and its conversion to Arrow among other formats.
>  * You can write a simple ROOT data source using the Python datasource API. 
> While this may not be optimal for performance, it is easy to implement and 
> can leverage the mentioned libraries.
>  * For better performance, ingest data via Arrow batches rather than row by 
> row, as the latter method is significantly slower (an initial test showed it 
> to be 8 times slower).
>  * Arrow is very popular now, and this enhancement can benefit other 
> communities beyond HEP that use Arrow for efficient data processing.
> This enhancement will provide substantial performance improvements and make 
> it easier to work with HEP data and other data types using Apache Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48493:
---
Labels: pull-request-available  (was: )

> Enhance Python Datasource Reader with Arrow Batch Support for Improved 
> Performance
> --
>
> Key: SPARK-48493
> URL: https://issues.apache.org/jira/browse/SPARK-48493
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Luca Canali
>Priority: Minor
>  Labels: pull-request-available
>
>  
> This proposes an enhancement to the Python Datasource Reader by adding an 
> option to yield Arrow batches directly, significantly boosting performance 
> compared to using tuples or Rows. This implementation uses the existing work 
> with MapInArrow (see SPARK-46253 ).
> In tests with a custom Python Datasource for High Energy Physics data (ROOT 
> format reader), using Arrow batches has demonstrated an 8x speed increase 
> over the traditional method of feeding data via tuples.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance

2024-05-31 Thread Luca Canali (Jira)
Luca Canali created SPARK-48493:
---

 Summary: Enhance Python Datasource Reader with Arrow Batch Support 
for Improved Performance
 Key: SPARK-48493
 URL: https://issues.apache.org/jira/browse/SPARK-48493
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Luca Canali


 
This proposes an enhancement to the Python Datasource Reader by adding an 
option to yield Arrow batches directly, significantly boosting performance 
compared to using tuples or Rows. This implementation uses the existing work 
with MapInArrow (see SPARK-46253 ).

In tests with a custom Python Datasource for High Energy Physics data (ROOT 
format reader), using Arrow batches has demonstrated an 8x speed increase over 
the traditional method of feeding data via tuples.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-48492:
---
Description: 
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job (read)")
df_stream_2.printSchema() 

# Now load a stream
df_stream_3 = (
    spark.readStream.format("parquet")
    .schema(df_stream_2.schema)
    .option("path", "test_parquet")
    .option("latestFirst", False)
    .load()
)

print("Streaming dataframe from the streaming job (readStream)")
df_stream_3.printSchema(){code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job (read)
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Streaming dataframe from the streaming job (readStream)
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = true){noformat}
 

So the column `d` is correctly set to `nullable = true` (expected), but in the 
case of the column `e`, it stays non-nullable if it is read using the `read` 
method and it is correctly set to `nullable = true` if read with `readStream`. 
Is that expected? According to this old issue, 
https://issues.apache.org/jira/browse/SPARK-28651, it was supposed to be 
resolved. Any ideas?

  was:
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() 

# Now load a stream
df_stream_3 = (
    spark.readStream.format("parquet")
    .schema(df_stream_2.schema)
    .option("path", "test_parquet")
 

[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-48492:
---
Description: 
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() 

# Now load a stream
df_stream_3 = (
    spark.readStream.format("parquet")
    .schema(df_stream_2.schema)
    .option("path", "test_parquet")
    .option("latestFirst", False)
    .load()
)

print("Streaming dataframe from the streaming job")
df_stream_3.printSchema(){code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Streaming dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = true){noformat}
 

So the column `d` is correctly set to `nullable = true` (expected), but in the 
case of the column `e`, it stays non-nullable if it is read using the `read` 
method and it is correctly set to `nullable = true` if read with `readStream`. 
Is that expected? According to this old issue, 
https://issues.apache.org/jira/browse/SPARK-28651, it was supposed to be 
resolved. Any ideas?

  was:
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() 

# Now load a stream
df_stream_3 = (
    spark.readStream.format("parquet")
    .schema(df_stream_2.schema)
    .option("path", "test_parquet")
    .option("latestFirst", False)
    

[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-48492:
---
Description: 
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() 

# Now load a stream
df_stream_3 = (
    spark.readStream.format("parquet")
    .schema(df_stream_2.schema)
    .option("path", "test_parquet")
    .option("latestFirst", False)
    .load()
)

print("Streaming dataframe from the streaming job")
df_stream_3.printSchema(){code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Streaming dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = true){noformat}
 

So the column `d` is correctly set to `nullable = true` (expected), but in the 
case of the column `e`, it stays non-nullable if it is read using the `read` 
method and it is correctly set to `nullable = true` if read with `readStream`. 
Is that expected? According to this old PR 
[https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)]
 it was supposed to be resolved. Any ideas?

  was:
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() 

# Now load a stream
df_stream_3 = (
    spark.readStream.format("parquet")
    .schema(df_stream_2.schema)
    .option("path", "test_parquet")
    

[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-48492:
---
Description: 
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() 

# Now load a stream
df_stream_3 = (
    spark.readStream.format("parquet")
    .schema(df_stream_2.schema)
    .option("path", "test_parquet")
    .option("latestFirst", False)
    .load()
)

print("Streaming dataframe from the streaming job")
df_stream_3.printSchema(){code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Streaming dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = true){noformat}
 

So the column `d` is correctly set to `nullable = true` (expected), but in the 
case of the column `e`, it stays non-nullable if it is read using `read` method 
and it is correctly set to `nullable = true` is read with `readStream`. Is that 
expected? According to this old PR 
[https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)]
 it was supposed to be resolved. Any ideas?

  was:
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() {code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = 

[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-48492:
---
Description: 
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for static DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() {code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false){noformat}
 

So the column `d` is correctly set to `nullable = true`, but not the column 
`e`, which stays non-nullable. Is that expected? According to this old PR 
[https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)]
 it was supposed to be resolved. Any ideas?

  was:
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for batch DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() {code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false){noformat}
 

So the column `d` is correctly set to `nullable = true`, but not the column 
`e`, which stays non-nullable. Is that expected? According to this old PR 

[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-48492:
---
Description: 
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for batch DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")

print("Static dataframe from the streaming job")
df_stream_2.printSchema() {code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false){noformat}
 

So the column `d` is correctly set to `nullable = true`, but not the column 
`e`, which stays non-nullable. Is that expected? According to this old PR 
[https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)]
 it was supposed to be resolved. Any ideas?

  was:
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for batch DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()# Write this on disk

df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")
print("Static dataframe from the streaming job")
df_stream_2.printSchema() {code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false){noformat}
 

So the column `d` is correctly set to `nullable = true`, but not the column 
`e`, which stays non-nullable. Is that expected? According to this [old 

[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-48492:
---
Description: 
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for batch DataFrames, I have a counter example for 
streaming ones:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

# add a non-nullable column
df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()# Write this on disk

df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

# add a non-nullable column
df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")
print("Static dataframe from the streaming job")
df_stream_2.printSchema() {code}
 

 

which outputs:
{noformat}
Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false){noformat}
 

So the column `d` is correctly set to `nullable = true`, but not the column 
`e`, which stays non-nullable. Is that expected? According to this [old 
PR]([https://github.com/apache/spark/pull/25382)] it was supposed to be 
resolved. Any ideas?

  was:
Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for batch DataFrames, I have a counter example for 
streaming ones:

```python

from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

 

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")
print("Static dataframe from the streaming job")
df_stream_2.printSchema()

```

which outputs:

```

Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

```

So the column `d` is correctly set to `nullable = true`, but not the column 
`e`, which stays non-nullable. Is that expected? According to this [old 
PR]([https://github.com/apache/spark/pull/25382)] it was supposed to be 
resolved. Any ideas?


> batch-read parquet files 

[jira] [Created] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema

2024-05-31 Thread Julien Peloton (Jira)
Julien Peloton created SPARK-48492:
--

 Summary: batch-read parquet files written by streaming returns 
non-nullable fields in schema
 Key: SPARK-48492
 URL: https://issues.apache.org/jira/browse/SPARK-48492
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.4.1
 Environment: python --version
Python 3.9.13

 

spark-submit --version
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.1
      /_/
                        
Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 1.8.0_302
Branch HEAD
Compiled by user centos on 2023-06-19T23:01:01Z
Revision 6b1ff22dde1ead51cbf370be6e48a802daae58b6
Reporter: Julien Peloton


Hello,

In the documentation, it is stated that

> When reading Parquet files, all columns are automatically converted to be 
> nullable for compatibility reasons.

While this seems correct for batch DataFrames, I have a counter example for 
streaming ones:

```python

from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(
    [
        Row(a=1, b=2.0, c="toto"),
        Row(a=3, b=4.0, c="titi"),
        Row(a=10, b=11.0, c="tutu"),
    ]
)

df = df.withColumn('d', F.lit(1.0))

print("Original dataframe")
df.printSchema()

# Write this on disk
df.write.parquet('static.parquet')

# Now load a stream
df_stream = (
    spark.readStream.format("parquet")
    .schema(df.schema)
    .option("path", "static.parquet")
    .option("latestFirst", False)
    .load()
)

df_stream = df_stream.withColumn('e', F.lit("error"))

print("Streaming dataframe")
df_stream.printSchema()

 

# Now write the dataframe using writestream
query = (
    df_stream.writeStream.outputMode("append")
    .format("parquet")
    .option("checkpointLocation", 'test_parquet_checkpoint')
    .option("path", 'test_parquet')
    .trigger(availableNow=True)
    .start()
)

spark.streams.awaitAnyTermination()

# Now read back
df_stream_2 = spark.read.format("parquet").load("test_parquet")
print("Static dataframe from the streaming job")
df_stream_2.printSchema()

```

which outputs:

```

Original dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = false)

Streaming dataframe
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

Static dataframe from the streaming job
root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: double (nullable = true)
 |-- e: string (nullable = false)

```

So the column `d` is correctly set to `nullable = true`, but not the column 
`e`, which stays non-nullable. Is that expected? According to this [old 
PR]([https://github.com/apache/spark/pull/25382)] it was supposed to be 
resolved. Any ideas?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48491) Refactor HiveWindowFunctionQuerySuite in BeforeAll and AfterAll

2024-05-31 Thread Rui Wang (Jira)
Rui Wang created SPARK-48491:


 Summary: Refactor HiveWindowFunctionQuerySuite in BeforeAll and 
AfterAll
 Key: SPARK-48491
 URL: https://issues.apache.org/jira/browse/SPARK-48491
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 4.0.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-05-31 Thread Steve Weis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Weis updated SPARK-47172:
---
Due Date: 12/Jun/24  (was: 26/Mar/24)

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48430) Fix map value extraction when map contains collated strings

2024-05-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48430.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46758
[https://github.com/apache/spark/pull/46758]

> Fix map value extraction when map contains collated strings
> ---
>
> Key: SPARK-48430
> URL: https://issues.apache.org/jira/browse/SPARK-48430
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Following queries return unexpected results:
> {code:java}
> select collation(map('a', 'b' collate utf8_binary_lcase)['a']);
> select collation(element_at(map('a', 'b' collate utf8_binary_lcase), 
> 'a'));{code}
> Both return UTF8_BINARY instead of UTF8_BINARY_LCASE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48476) NPE thrown when delimiter set to null in CSV

2024-05-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48476.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46810
[https://github.com/apache/spark/pull/46810]

> NPE thrown when delimiter set to null in CSV
> 
>
> Key: SPARK-48476
> URL: https://issues.apache.org/jira/browse/SPARK-48476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Milan Stefanovic
>Assignee: Milan Stefanovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When customers specified delimiter to null, currently we throw NPE. We should 
> throw customer facing error
> repro:
> spark.read.format("csv")
> .option("delimiter", null)
> .load()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48476) NPE thrown when delimiter set to null in CSV

2024-05-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48476:
---

Assignee: Milan Stefanovic

> NPE thrown when delimiter set to null in CSV
> 
>
> Key: SPARK-48476
> URL: https://issues.apache.org/jira/browse/SPARK-48476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Milan Stefanovic
>Assignee: Milan Stefanovic
>Priority: Major
>  Labels: pull-request-available
>
> When customers specified delimiter to null, currently we throw NPE. We should 
> throw customer facing error
> repro:
> spark.read.format("csv")
> .option("delimiter", null)
> .load()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48490) Unescapes any literals for message of MessageWithContext

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48490:
---
Labels: pull-request-available  (was: )

> Unescapes any literals for message of MessageWithContext
> 
>
> Key: SPARK-48490
> URL: https://issues.apache.org/jira/browse/SPARK-48490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48484) V2Write use the same TaskAttemptId for different task attempts

2024-05-31 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48484:


Assignee: Jackey Lee

> V2Write use the same TaskAttemptId for different task attempts
> --
>
> Key: SPARK-48484
> URL: https://issues.apache.org/jira/browse/SPARK-48484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Yang Jie
>Assignee: Jackey Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48490) Unescapes any literals for message of MessageWithContext

2024-05-31 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48490:
---

 Summary: Unescapes any literals for message of MessageWithContext
 Key: SPARK-48490
 URL: https://issues.apache.org/jira/browse/SPARK-48490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48484) V2Write use the same TaskAttemptId for different task attempts

2024-05-31 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48484.
--
Fix Version/s: 3.4.4
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46811
[https://github.com/apache/spark/pull/46811]

> V2Write use the same TaskAttemptId for different task attempts
> --
>
> Key: SPARK-48484
> URL: https://issues.apache.org/jira/browse/SPARK-48484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 3.5.2, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48489:
---
Labels: pull-request-available  (was: )

> Throw an user-facing error when reading invalid schema from text DataSource
> ---
>
> Key: SPARK-48489
> URL: https://issues.apache.org/jira/browse/SPARK-48489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Stefan Bukorovic
>Priority: Minor
>  Labels: pull-request-available
>
> Text DataSource produces table schema with only 1 column, but it is possible 
> to try and create a table with schema having multiple columns.
> Currently, when user tries this, we have an assert in the code, which fails 
> and throws internal spark error. We should throw a better user-facing error.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource

2024-05-31 Thread Stefan Bukorovic (Jira)
Stefan Bukorovic created SPARK-48489:


 Summary: Throw an user-facing error when reading invalid schema 
from text DataSource
 Key: SPARK-48489
 URL: https://issues.apache.org/jira/browse/SPARK-48489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.3
Reporter: Stefan Bukorovic


Text DataSource produces table schema with only 1 column, but it is possible to 
try and create a table with schema having multiple columns.

Currently, when user tries this, we have an assert in the code, which fails and 
throws internal spark error. We should throw a better user-facing error.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47690) Hash aggregate support for strings with collation

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47690:
---
Labels: pull-request-available  (was: )

> Hash aggregate support for strings with collation
> -
>
> Key: SPARK-47690
> URL: https://issues.apache.org/jira/browse/SPARK-47690
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48488:
---
Labels: pull-request-available  (was: )

> Restore the original logic of methods `log[info|warning|error]` in 
> `SparkSubmit`
> 
>
> Key: SPARK-48488
> URL: https://issues.apache.org/jira/browse/SPARK-48488
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`

2024-05-31 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48488:
---

 Summary: Restore the original logic of methods 
`log[info|warning|error]` in `SparkSubmit`
 Key: SPARK-48488
 URL: https://issues.apache.org/jira/browse/SPARK-48488
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48487) Update License & Notice according to the dependency changes

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48487:
---
Labels: pull-request-available  (was: )

> Update License & Notice according to the dependency changes
> ---
>
> Key: SPARK-48487
> URL: https://issues.apache.org/jira/browse/SPARK-48487
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48487) Update License & Notice according to the dependency changes

2024-05-31 Thread Kent Yao (Jira)
Kent Yao created SPARK-48487:


 Summary: Update License & Notice according to the dependency 
changes
 Key: SPARK-48487
 URL: https://issues.apache.org/jira/browse/SPARK-48487
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48476) NPE thrown when delimiter set to null in CSV

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48476:
--

Assignee: (was: Apache Spark)

> NPE thrown when delimiter set to null in CSV
> 
>
> Key: SPARK-48476
> URL: https://issues.apache.org/jira/browse/SPARK-48476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Milan Stefanovic
>Priority: Major
>  Labels: pull-request-available
>
> When customers specified delimiter to null, currently we throw NPE. We should 
> throw customer facing error
> repro:
> spark.read.format("csv")
> .option("delimiter", null)
> .load()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48476) NPE thrown when delimiter set to null in CSV

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48476:
--

Assignee: Apache Spark

> NPE thrown when delimiter set to null in CSV
> 
>
> Key: SPARK-48476
> URL: https://issues.apache.org/jira/browse/SPARK-48476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Milan Stefanovic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> When customers specified delimiter to null, currently we throw NPE. We should 
> throw customer facing error
> repro:
> spark.read.format("csv")
> .option("delimiter", null)
> .load()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47258:
--

Assignee: Apache Spark

> Assign error classes to SHOW CREATE TABLE errors
> 
>
> Key: SPARK-47258
> URL: https://issues.apache.org/jira/browse/SPARK-47258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47258:
--

Assignee: (was: Apache Spark)

> Assign error classes to SHOW CREATE TABLE errors
> 
>
> Key: SPARK-47258
> URL: https://issues.apache.org/jira/browse/SPARK-47258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851000#comment-17851000
 ] 

chenfengbin commented on SPARK-48486:
-

The reason for the problem starting from Spark 3.3 is that the method 
getFilterableTableScan in 
org.apache.spark.sql.execution.dynamicpruning.PartitionPruning has added 
support for HiveTableRelation. Then, after the traversal of plan transformUp in 
the prune method, amplification occurred during the recursion of 
insertPredicate in splitConjunctivePredicates(condition).foreach.

> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> 
>
> Key: SPARK-48486
> URL: https://issues.apache.org/jira/browse/SPARK-48486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chenfengbin
>Priority: Major
> Attachments: create_table.sql
>
>
> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> for example: The partition field of table A or B is part_dt.  
> select a.part_dt,{*}b.part_dt{*}
> from a left join b on b.part_dt = a.part_dt 
> where a.part_dt in ('2024-05-30','2024-05-31')
> During the generation of Dynamic Partition Pruning (DPP), the tree will be 
> traversed recursively more times. 
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#223 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
> [part_dt#223|#223]
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
> directories with: part_dt#219 IN (2024-05-30,2024-05-31)
> When more partitions meet the condition, it will increase by 2 to the power 
> of n, then divided by 2.
> for example:
> select a.part_dt{*},b.part_dt{*},c.part_dt
> from a 
> left join b on b.part_dt = a.part_dt 
> left join c on c.part_dt = a.part_dt
> where a.part_dt in ('2024-05-30','2024-05-31')
> result:
> 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#257 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 
> [part_dt#257|#257]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> The last four lines are all superfluous.
> When more than 10 conditions are met, it will cause a memory overflow when 
> generating DPP, and the generated plan will be about 400 times larger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenfengbin updated SPARK-48486:

Attachment: create_table.sql

> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> 
>
> Key: SPARK-48486
> URL: https://issues.apache.org/jira/browse/SPARK-48486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chenfengbin
>Priority: Major
> Attachments: create_table.sql
>
>
> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> for example: The partition field of table A or B is part_dt.  
> select a.part_dt,{*}b.part_dt{*}
> from a left join b on b.part_dt = a.part_dt 
> where a.part_dt in ('2024-05-30','2024-05-31')
> During the generation of Dynamic Partition Pruning (DPP), the tree will be 
> traversed recursively more times. 
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#223 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
> [part_dt#223|#223]
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
> directories with: part_dt#219 IN (2024-05-30,2024-05-31)
> When more partitions meet the condition, it will increase by 2 to the power 
> of n, then divided by 2.
> for example:
> select a.part_dt{*},b.part_dt{*},c.part_dt
> from a 
> left join b on b.part_dt = a.part_dt 
> left join c on c.part_dt = a.part_dt
> where a.part_dt in ('2024-05-30','2024-05-31')
> result:
> 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#257 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 
> [part_dt#257|#257]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> The last four lines are all superfluous.
> When more than 10 conditions are met, it will cause a memory overflow when 
> generating DPP, and the generated plan will be about 400 times larger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenfengbin updated SPARK-48486:

Attachment: create_table.sql.rtf

> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> 
>
> Key: SPARK-48486
> URL: https://issues.apache.org/jira/browse/SPARK-48486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chenfengbin
>Priority: Major
>
> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> for example: The partition field of table A or B is part_dt.  
> select a.part_dt,{*}b.part_dt{*}
> from a left join b on b.part_dt = a.part_dt 
> where a.part_dt in ('2024-05-30','2024-05-31')
> During the generation of Dynamic Partition Pruning (DPP), the tree will be 
> traversed recursively more times. 
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#223 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
> [part_dt#223|#223]
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
> directories with: part_dt#219 IN (2024-05-30,2024-05-31)
> When more partitions meet the condition, it will increase by 2 to the power 
> of n, then divided by 2.
> for example:
> select a.part_dt{*},b.part_dt{*},c.part_dt
> from a 
> left join b on b.part_dt = a.part_dt 
> left join c on c.part_dt = a.part_dt
> where a.part_dt in ('2024-05-30','2024-05-31')
> result:
> 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#257 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 
> [part_dt#257|#257]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> The last four lines are all superfluous.
> When more than 10 conditions are met, it will cause a memory overflow when 
> generating DPP, and the generated plan will be about 400 times larger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenfengbin updated SPARK-48486:

Attachment: (was: create_table.sql.rtf)

> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> 
>
> Key: SPARK-48486
> URL: https://issues.apache.org/jira/browse/SPARK-48486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chenfengbin
>Priority: Major
>
> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> for example: The partition field of table A or B is part_dt.  
> select a.part_dt,{*}b.part_dt{*}
> from a left join b on b.part_dt = a.part_dt 
> where a.part_dt in ('2024-05-30','2024-05-31')
> During the generation of Dynamic Partition Pruning (DPP), the tree will be 
> traversed recursively more times. 
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#223 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
> [part_dt#223|#223]
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
> directories with: part_dt#219 IN (2024-05-30,2024-05-31)
> When more partitions meet the condition, it will increase by 2 to the power 
> of n, then divided by 2.
> for example:
> select a.part_dt{*},b.part_dt{*},c.part_dt
> from a 
> left join b on b.part_dt = a.part_dt 
> left join c on c.part_dt = a.part_dt
> where a.part_dt in ('2024-05-30','2024-05-31')
> result:
> 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#257 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 
> [part_dt#257|#257]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#253 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
> [part_dt#253|#253]
> 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#249 IN (2024-05-30,2024-05-31)
> The last four lines are all superfluous.
> When more than 10 conditions are met, it will cause a memory overflow when 
> generating DPP, and the generated plan will be about 400 times larger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenfengbin updated SPARK-48486:

Description: 
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,{*}b.part_dt{*}
from a left join b on b.part_dt = a.part_dt 
where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
[part_dt#223|#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt{*},b.part_dt{*},c.part_dt
from a 
left join b on b.part_dt = a.part_dt 
left join c on c.part_dt = a.part_dt
where a.part_dt in ('2024-05-30','2024-05-31')

result:

24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
[part_dt#253|#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#257 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 
[part_dt#257|#257]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
[part_dt#253|#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)

The last four lines are all superfluous.

When more than 10 conditions are met, it will cause a memory overflow when 
generating DPP, and the generated plan will be about 400 times larger.

  was:
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.{*},b.part_dt,b.{*}

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
[part_dt#223|#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt{*},b.part_dt{*},c.part_dt
from a 
left join b on b.part_dt = a.part_dt 
left join c on c.part_dt = a.part_dt
where a.part_dt in ('2024-05-30','2024-05-31')

result:

24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
[part_dt#253|#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#257 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 
[part_dt#257|#257]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
[part_dt#253|#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)

The last four lines are all superfluous.


[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenfengbin updated SPARK-48486:

Description: 
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.{*},b.part_dt,b.{*}

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
[part_dt#223|#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt{*},b.part_dt{*},c.part_dt
from a 
left join b on b.part_dt = a.part_dt 
left join c on c.part_dt = a.part_dt
where a.part_dt in ('2024-05-30','2024-05-31')

result:

24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
[part_dt#253|#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#257 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 
[part_dt#257|#257]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 
[part_dt#253|#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)

The last four lines are all superfluous.

When more than 10 conditions are met, it will cause a memory overflow when 
generating DPP, and the generated plan will be about 400 times larger.

  was:
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.{*},b.part_dt,b.{*}

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
[part_dt#223|#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt{*},b.part_dt{*},c.part_dt
from a 
left join b on b.part_dt = a.part_dt 
left join c on c.part_dt = a.part_dt
where a.part_dt in ('2024-05-30','2024-05-31')

result:

24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#257 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 [part_dt#257]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)

The last four lines are all superfluous.


> The 

[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenfengbin updated SPARK-48486:

Description: 
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.{*},b.part_dt,b.{*}

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
[part_dt#223|#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt{*},b.part_dt{*},c.part_dt
from a 
left join b on b.part_dt = a.part_dt 
left join c on c.part_dt = a.part_dt
where a.part_dt in ('2024-05-30','2024-05-31')

result:

24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#257 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 [part_dt#257]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#253 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253]
24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: 
part_dt#249 IN (2024-05-30,2024-05-31)

The last four lines are all superfluous.

  was:
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.{*},b.part_dt,b.{*}

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
[part_dt#223|#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt,a.*,b.part_dt,b.*,c.part_dt,c.*
from a 
left join b on b.part_dt = a.part_dt 
left join c on c.part_dt = a.part_dt
where a.part_dt in ('2024-05-30','2024-05-31')

result:

 


> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> 
>
> Key: SPARK-48486
> URL: https://issues.apache.org/jira/browse/SPARK-48486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chenfengbin
>Priority: Major
>
> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> for example: The partition field of table A or B is part_dt.  
> select a.part_dt,a.{*},b.part_dt,b.{*}
> from a left join b on b.part_dt = a.part_dt 
> where a.part_dt in ('2024-05-30','2024-05-31')
> During the generation of Dynamic Partition Pruning (DPP), the tree will be 
> traversed recursively more times. 
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#223 IN 
> 

[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenfengbin updated SPARK-48486:

Description: 
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.{*},b.part_dt,b.{*}

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
[part_dt#223|#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt,a.*,b.part_dt,b.*,c.part_dt,c.*
from a 
left join b on b.part_dt = a.part_dt 
left join c on c.part_dt = a.part_dt
where a.part_dt in ('2024-05-30','2024-05-31')

result:

 

  was:
The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.*,b.part_dt,b.*

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt,a.*,b.part_dt,b.*
from a left join b on b.part_dt = a.part_dt where a.part_dt in 
('2024-05-30','2024-05-31')

 

 


> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> 
>
> Key: SPARK-48486
> URL: https://issues.apache.org/jira/browse/SPARK-48486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chenfengbin
>Priority: Major
>
> The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
> Directed Acyclic Graph (DAG) to expand.
> for example: The partition field of table A or B is part_dt.  
> select a.part_dt,a.{*},b.part_dt,b.{*}
> from a left join b on b.part_dt = a.part_dt 
> where a.part_dt in ('2024-05-30','2024-05-31')
> During the generation of Dynamic Partition Pruning (DPP), the tree will be 
> traversed recursively more times. 
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#223 IN 
> (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 
> [part_dt#223|#223]
> 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
> part_dt#219 IN (2024-05-30,2024-05-31)
> The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
> directories with: part_dt#219 IN (2024-05-30,2024-05-31)
> When more partitions meet the condition, it will increase by 2 to the power 
> of n, then divided by 2.
> for example:
> select a.part_dt,a.*,b.part_dt,b.*,c.part_dt,c.*
> from a 
> left join b on b.part_dt = a.part_dt 
> left join c on c.part_dt = a.part_dt
> where a.part_dt in ('2024-05-30','2024-05-31')
> result:
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.

2024-05-31 Thread chenfengbin (Jira)
chenfengbin created SPARK-48486:
---

 Summary: The Dynamic Partition Pruning (DPP) feature in Spark can 
cause the generated Directed Acyclic Graph (DAG) to expand.
 Key: SPARK-48486
 URL: https://issues.apache.org/jira/browse/SPARK-48486
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: chenfengbin


The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated 
Directed Acyclic Graph (DAG) to expand.

for example: The partition field of table A or B is part_dt.  

select a.part_dt,a.*,b.part_dt,b.*

from a left join b on b.part_dt = a.part_dt 

where a.part_dt in ('2024-05-30','2024-05-31')

During the generation of Dynamic Partition Pruning (DPP), the tree will be 
traversed recursively more times. 

24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#223 IN 
(2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223]
24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: 
part_dt#219 IN (2024-05-30,2024-05-31)

The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning 
directories with: part_dt#219 IN (2024-05-30,2024-05-31)

When more partitions meet the condition, it will increase by 2 to the power of 
n, then divided by 2.

for example:

select a.part_dt,a.*,b.part_dt,b.*
from a left join b on b.part_dt = a.part_dt where a.part_dt in 
('2024-05-30','2024-05-31')

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48485) Support interruptTag and interruptAll in streaming queries

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48485:
---
Labels: pull-request-available  (was: )

> Support interruptTag and interruptAll in streaming queries
> --
>
> Key: SPARK-48485
> URL: https://issues.apache.org/jira/browse/SPARK-48485
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Spark Connect's interrupt API does not interrupt streaming queries. We should 
> support them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48485) Support interruptTag and interruptAll in streaming queries

2024-05-31 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-48485:


 Summary: Support interruptTag and interruptAll in streaming queries
 Key: SPARK-48485
 URL: https://issues.apache.org/jira/browse/SPARK-48485
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Structured Streaming
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


Spark Connect's interrupt API does not interrupt streaming queries. We should 
support them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38506) Push partial aggregation through join

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-38506:
---
Labels: pull-request-available  (was: )

> Push partial aggregation through join
> -
>
> Key: SPARK-38506
> URL: https://issues.apache.org/jira/browse/SPARK-38506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> Please see 
> https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org