[jira] [Updated] (SPARK-48499) Use Math.abs to get positive numbers
[ https://issues.apache.org/jira/browse/SPARK-48499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48499: --- Labels: pull-request-available (was: ) > Use Math.abs to get positive numbers > > > Key: SPARK-48499 > URL: https://issues.apache.org/jira/browse/SPARK-48499 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Junqing Li >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48499) Use Math.abs to get positive numbers
[ https://issues.apache.org/jira/browse/SPARK-48499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junqing Li updated SPARK-48499: --- Summary: Use Math.abs to get positive numbers (was: Use Math.abs to get Unsigned Numbers) > Use Math.abs to get positive numbers > > > Key: SPARK-48499 > URL: https://issues.apache.org/jira/browse/SPARK-48499 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Junqing Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48499) Use Math.abs to get Unsigned Numbers
Junqing Li created SPARK-48499: -- Summary: Use Math.abs to get Unsigned Numbers Key: SPARK-48499 URL: https://issues.apache.org/jira/browse/SPARK-48499 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Junqing Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48497) Add user guide for batch data source write API
[ https://issues.apache.org/jira/browse/SPARK-48497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48497: --- Labels: pull-request-available (was: ) > Add user guide for batch data source write API > -- > > Key: SPARK-48497 > URL: https://issues.apache.org/jira/browse/SPARK-48497 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Add examples for batch data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48497) Add user guide for batch data source write API
Allison Wang created SPARK-48497: Summary: Add user guide for batch data source write API Key: SPARK-48497 URL: https://issues.apache.org/jira/browse/SPARK-48497 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add examples for batch data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48490) Unescapes any literals for message of MessageWithContext
[ https://issues.apache.org/jira/browse/SPARK-48490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-48490: -- Assignee: BingKun Pan > Unescapes any literals for message of MessageWithContext > > > Key: SPARK-48490 > URL: https://issues.apache.org/jira/browse/SPARK-48490 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48490) Unescapes any literals for message of MessageWithContext
[ https://issues.apache.org/jira/browse/SPARK-48490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-48490. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46824 [https://github.com/apache/spark/pull/46824] > Unescapes any literals for message of MessageWithContext > > > Key: SPARK-48490 > URL: https://issues.apache.org/jira/browse/SPARK-48490 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate
[ https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48391: --- Assignee: jiahong.li > use addAll instead of add function in TaskMetrics to accelerate > - > > Key: SPARK-48391 > URL: https://issues.apache.org/jira/browse/SPARK-48391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: jiahong.li >Assignee: jiahong.li >Priority: Major > Labels: pull-request-available > > In the fromAccumulators method of TaskMetrics,we should use ` > tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as > _externalAccums is a instance of CopyOnWriteArrayList -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate
[ https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48391. - Fix Version/s: 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 46705 [https://github.com/apache/spark/pull/46705] > use addAll instead of add function in TaskMetrics to accelerate > - > > Key: SPARK-48391 > URL: https://issues.apache.org/jira/browse/SPARK-48391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: jiahong.li >Assignee: jiahong.li >Priority: Major > Labels: pull-request-available > Fix For: 3.5.2, 4.0.0 > > > In the fromAccumulators method of TaskMetrics,we should use ` > tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as > _externalAccums is a instance of CopyOnWriteArrayList -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48466) Wrap empty relation propagation in a dedicated node
[ https://issues.apache.org/jira/browse/SPARK-48466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48466: --- Labels: pull-request-available (was: ) > Wrap empty relation propagation in a dedicated node > --- > > Key: SPARK-48466 > URL: https://issues.apache.org/jira/browse/SPARK-48466 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Ziqi Liu >Priority: Major > Labels: pull-request-available > > Currently we replace with a LocalTableScan in case of empty relation > propagation, which lost the information about the original query plan and > make it less human readable. The idea is to create a dedicated > `EmptyRelation` node which is a lead node but wraps the original query plan > inside. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48491) Refactor HiveWindowFunctionQuerySuite in BeforeAll and AfterAll
[ https://issues.apache.org/jira/browse/SPARK-48491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48491: --- Labels: pull-request-available (was: ) > Refactor HiveWindowFunctionQuerySuite in BeforeAll and AfterAll > --- > > Key: SPARK-48491 > URL: https://issues.apache.org/jira/browse/SPARK-48491 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48495) Document planned approach to shredding
David Cashman created SPARK-48495: - Summary: Document planned approach to shredding Key: SPARK-48495 URL: https://issues.apache.org/jira/browse/SPARK-48495 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: David Cashman -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48465) Avoid no-op empty relation propagation in AQE
[ https://issues.apache.org/jira/browse/SPARK-48465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48465. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46814 [https://github.com/apache/spark/pull/46814] > Avoid no-op empty relation propagation in AQE > - > > Key: SPARK-48465 > URL: https://issues.apache.org/jira/browse/SPARK-48465 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Ziqi Liu >Assignee: Ziqi Liu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We should avoid no-op empty relation propagation in AQE: if we convert an > empty QueryStageExec to empty relation, it will further wrapped into a new > query stage and execute -> produce empty result -> empty relation propagation > again. This issue is currently not exposed because AQE will try to reuse > shuffle. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48494) Update airlift:aircompressor to 0.27
[ https://issues.apache.org/jira/browse/SPARK-48494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48494: --- Labels: pull-request-available (was: ) > Update airlift:aircompressor to 0.27 > > > Key: SPARK-48494 > URL: https://issues.apache.org/jira/browse/SPARK-48494 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > [CVE-2024-36114|https://www.cve.org/CVERecord?id=CVE-2024-36114] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance
[ https://issues.apache.org/jira/browse/SPARK-48493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-48493: Description: This enhancement adds an option to the Python Datasource Reader to yield Arrow batches directly, significantly boosting performance compared to using tuples or Rows. This implementation leverages the existing work with MapInArrow (see SPARK-46253 ). Tests with a custom Python Datasource for High Energy Physics (HEP) data using the ROOT format reader showed an 8x speed increase when using Arrow batches over the traditional method of feeding data via tuples. Additional context: * The ROOT data format is widely used in High Energy Physics (HEP) with approximately 1 exabyte of ROOT data currently in existence. * You can easily read ROOT data using libraries from the Python ecosystem, notably {{uproot}} and {{{}awkward-array{}}}. These libraries facilitate the reading of ROOT data and its conversion to Arrow among other formats. * You can write a simple ROOT data source using the Python datasource API. While this may not be optimal for performance, it is easy to implement and can leverage the mentioned libraries. * For better performance, ingest data via Arrow batches rather than row by row, as the latter method is significantly slower (an initial test showed it to be 8 times slower). * Arrow is very popular now, and this enhancement can benefit other communities beyond HEP that use Arrow for efficient data processing. This enhancement will provide substantial performance improvements and make it easier to work with HEP data and other data types using Apache Spark. was: This proposes an enhancement to the Python Datasource Reader by adding an option to yield Arrow batches directly, significantly boosting performance compared to using tuples or Rows. This implementation uses the existing work with MapInArrow (see SPARK-46253 ). In tests with a custom Python Datasource for High Energy Physics data (ROOT format reader), using Arrow batches has demonstrated an 8x speed increase over the traditional method of feeding data via tuples. > Enhance Python Datasource Reader with Arrow Batch Support for Improved > Performance > -- > > Key: SPARK-48493 > URL: https://issues.apache.org/jira/browse/SPARK-48493 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Luca Canali >Priority: Minor > Labels: pull-request-available > > This enhancement adds an option to the Python Datasource Reader to yield > Arrow batches directly, significantly boosting performance compared to using > tuples or Rows. This implementation leverages the existing work with > MapInArrow (see SPARK-46253 ). > Tests with a custom Python Datasource for High Energy Physics (HEP) data > using the ROOT format reader showed an 8x speed increase when using Arrow > batches over the traditional method of feeding data via tuples. > Additional context: > * The ROOT data format is widely used in High Energy Physics (HEP) with > approximately 1 exabyte of ROOT data currently in existence. > * You can easily read ROOT data using libraries from the Python ecosystem, > notably {{uproot}} and {{{}awkward-array{}}}. These libraries facilitate the > reading of ROOT data and its conversion to Arrow among other formats. > * You can write a simple ROOT data source using the Python datasource API. > While this may not be optimal for performance, it is easy to implement and > can leverage the mentioned libraries. > * For better performance, ingest data via Arrow batches rather than row by > row, as the latter method is significantly slower (an initial test showed it > to be 8 times slower). > * Arrow is very popular now, and this enhancement can benefit other > communities beyond HEP that use Arrow for efficient data processing. > This enhancement will provide substantial performance improvements and make > it easier to work with HEP data and other data types using Apache Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance
[ https://issues.apache.org/jira/browse/SPARK-48493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48493: --- Labels: pull-request-available (was: ) > Enhance Python Datasource Reader with Arrow Batch Support for Improved > Performance > -- > > Key: SPARK-48493 > URL: https://issues.apache.org/jira/browse/SPARK-48493 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Luca Canali >Priority: Minor > Labels: pull-request-available > > > This proposes an enhancement to the Python Datasource Reader by adding an > option to yield Arrow batches directly, significantly boosting performance > compared to using tuples or Rows. This implementation uses the existing work > with MapInArrow (see SPARK-46253 ). > In tests with a custom Python Datasource for High Energy Physics data (ROOT > format reader), using Arrow batches has demonstrated an 8x speed increase > over the traditional method of feeding data via tuples. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance
Luca Canali created SPARK-48493: --- Summary: Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance Key: SPARK-48493 URL: https://issues.apache.org/jira/browse/SPARK-48493 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Luca Canali This proposes an enhancement to the Python Datasource Reader by adding an option to yield Arrow batches directly, significantly boosting performance compared to using tuples or Rows. This implementation uses the existing work with MapInArrow (see SPARK-46253 ). In tests with a custom Python Datasource for High Energy Physics data (ROOT format reader), using Arrow batches has demonstrated an 8x speed increase over the traditional method of feeding data via tuples. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
[ https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Peloton updated SPARK-48492: --- Description: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job (read)") df_stream_2.printSchema() # Now load a stream df_stream_3 = ( spark.readStream.format("parquet") .schema(df_stream_2.schema) .option("path", "test_parquet") .option("latestFirst", False) .load() ) print("Streaming dataframe from the streaming job (readStream)") df_stream_3.printSchema(){code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job (read) root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Streaming dataframe from the streaming job (readStream) root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = true){noformat} So the column `d` is correctly set to `nullable = true` (expected), but in the case of the column `e`, it stays non-nullable if it is read using the `read` method and it is correctly set to `nullable = true` if read with `readStream`. Is that expected? According to this old issue, https://issues.apache.org/jira/browse/SPARK-28651, it was supposed to be resolved. Any ideas? was: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() # Now load a stream df_stream_3 = ( spark.readStream.format("parquet") .schema(df_stream_2.schema) .option("path", "test_parquet")
[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
[ https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Peloton updated SPARK-48492: --- Description: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() # Now load a stream df_stream_3 = ( spark.readStream.format("parquet") .schema(df_stream_2.schema) .option("path", "test_parquet") .option("latestFirst", False) .load() ) print("Streaming dataframe from the streaming job") df_stream_3.printSchema(){code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Streaming dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = true){noformat} So the column `d` is correctly set to `nullable = true` (expected), but in the case of the column `e`, it stays non-nullable if it is read using the `read` method and it is correctly set to `nullable = true` if read with `readStream`. Is that expected? According to this old issue, https://issues.apache.org/jira/browse/SPARK-28651, it was supposed to be resolved. Any ideas? was: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() # Now load a stream df_stream_3 = ( spark.readStream.format("parquet") .schema(df_stream_2.schema) .option("path", "test_parquet") .option("latestFirst", False)
[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
[ https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Peloton updated SPARK-48492: --- Description: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() # Now load a stream df_stream_3 = ( spark.readStream.format("parquet") .schema(df_stream_2.schema) .option("path", "test_parquet") .option("latestFirst", False) .load() ) print("Streaming dataframe from the streaming job") df_stream_3.printSchema(){code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Streaming dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = true){noformat} So the column `d` is correctly set to `nullable = true` (expected), but in the case of the column `e`, it stays non-nullable if it is read using the `read` method and it is correctly set to `nullable = true` if read with `readStream`. Is that expected? According to this old PR [https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)] it was supposed to be resolved. Any ideas? was: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() # Now load a stream df_stream_3 = ( spark.readStream.format("parquet") .schema(df_stream_2.schema) .option("path", "test_parquet")
[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
[ https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Peloton updated SPARK-48492: --- Description: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() # Now load a stream df_stream_3 = ( spark.readStream.format("parquet") .schema(df_stream_2.schema) .option("path", "test_parquet") .option("latestFirst", False) .load() ) print("Streaming dataframe from the streaming job") df_stream_3.printSchema(){code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Streaming dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = true){noformat} So the column `d` is correctly set to `nullable = true` (expected), but in the case of the column `e`, it stays non-nullable if it is read using `read` method and it is correctly set to `nullable = true` is read with `readStream`. Is that expected? According to this old PR [https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)] it was supposed to be resolved. Any ideas? was: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() {code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable =
[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
[ https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Peloton updated SPARK-48492: --- Description: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for static DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() {code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false){noformat} So the column `d` is correctly set to `nullable = true`, but not the column `e`, which stays non-nullable. Is that expected? According to this old PR [https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)] it was supposed to be resolved. Any ideas? was: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for batch DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() {code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false){noformat} So the column `d` is correctly set to `nullable = true`, but not the column `e`, which stays non-nullable. Is that expected? According to this old PR
[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
[ https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Peloton updated SPARK-48492: --- Description: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for batch DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() {code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false){noformat} So the column `d` is correctly set to `nullable = true`, but not the column `e`, which stays non-nullable. Is that expected? According to this old PR [https://github.com/apache/spark/pull/25382|https://github.com/apache/spark/pull/25382)] it was supposed to be resolved. Any ideas? was: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for batch DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema()# Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() {code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false){noformat} So the column `d` is correctly set to `nullable = true`, but not the column `e`, which stays non-nullable. Is that expected? According to this [old
[jira] [Updated] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
[ https://issues.apache.org/jira/browse/SPARK-48492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Peloton updated SPARK-48492: --- Description: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for batch DataFrames, I have a counter example for streaming ones: {code:java} from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) # add a non-nullable column df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema()# Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) # add a non-nullable column df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() {code} which outputs: {noformat} Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false){noformat} So the column `d` is correctly set to `nullable = true`, but not the column `e`, which stays non-nullable. Is that expected? According to this [old PR]([https://github.com/apache/spark/pull/25382)] it was supposed to be resolved. Any ideas? was: Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for batch DataFrames, I have a counter example for streaming ones: ```python from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() ``` which outputs: ``` Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) ``` So the column `d` is correctly set to `nullable = true`, but not the column `e`, which stays non-nullable. Is that expected? According to this [old PR]([https://github.com/apache/spark/pull/25382)] it was supposed to be resolved. Any ideas? > batch-read parquet files
[jira] [Created] (SPARK-48492) batch-read parquet files written by streaming returns non-nullable fields in schema
Julien Peloton created SPARK-48492: -- Summary: batch-read parquet files written by streaming returns non-nullable fields in schema Key: SPARK-48492 URL: https://issues.apache.org/jira/browse/SPARK-48492 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.4.1 Environment: python --version Python 3.9.13 spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.1 /_/ Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 1.8.0_302 Branch HEAD Compiled by user centos on 2023-06-19T23:01:01Z Revision 6b1ff22dde1ead51cbf370be6e48a802daae58b6 Reporter: Julien Peloton Hello, In the documentation, it is stated that > When reading Parquet files, all columns are automatically converted to be > nullable for compatibility reasons. While this seems correct for batch DataFrames, I have a counter example for streaming ones: ```python from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F spark = SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame( [ Row(a=1, b=2.0, c="toto"), Row(a=3, b=4.0, c="titi"), Row(a=10, b=11.0, c="tutu"), ] ) df = df.withColumn('d', F.lit(1.0)) print("Original dataframe") df.printSchema() # Write this on disk df.write.parquet('static.parquet') # Now load a stream df_stream = ( spark.readStream.format("parquet") .schema(df.schema) .option("path", "static.parquet") .option("latestFirst", False) .load() ) df_stream = df_stream.withColumn('e', F.lit("error")) print("Streaming dataframe") df_stream.printSchema() # Now write the dataframe using writestream query = ( df_stream.writeStream.outputMode("append") .format("parquet") .option("checkpointLocation", 'test_parquet_checkpoint') .option("path", 'test_parquet') .trigger(availableNow=True) .start() ) spark.streams.awaitAnyTermination() # Now read back df_stream_2 = spark.read.format("parquet").load("test_parquet") print("Static dataframe from the streaming job") df_stream_2.printSchema() ``` which outputs: ``` Original dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = false) Streaming dataframe root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) Static dataframe from the streaming job root |-- a: long (nullable = true) |-- b: double (nullable = true) |-- c: string (nullable = true) |-- d: double (nullable = true) |-- e: string (nullable = false) ``` So the column `d` is correctly set to `nullable = true`, but not the column `e`, which stays non-nullable. Is that expected? According to this [old PR]([https://github.com/apache/spark/pull/25382)] it was supposed to be resolved. Any ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48491) Refactor HiveWindowFunctionQuerySuite in BeforeAll and AfterAll
Rui Wang created SPARK-48491: Summary: Refactor HiveWindowFunctionQuerySuite in BeforeAll and AfterAll Key: SPARK-48491 URL: https://issues.apache.org/jira/browse/SPARK-48491 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 4.0.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47172) Upgrade Transport block cipher mode to GCM
[ https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Weis updated SPARK-47172: --- Due Date: 12/Jun/24 (was: 26/Mar/24) > Upgrade Transport block cipher mode to GCM > -- > > Key: SPARK-47172 > URL: https://issues.apache.org/jira/browse/SPARK-47172 > Project: Spark > Issue Type: Improvement > Components: Security >Affects Versions: 3.4.2, 3.5.0 >Reporter: Steve Weis >Priority: Minor > Labels: pull-request-available > > The cipher transformation currently used for encrypting RPC calls is an > unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an > authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being > modified in transit. > The relevant line is here: > [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220] > GCM is relatively more computationally expensive than CTR and adds a 16-byte > block of authentication tag data to each payload. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48430) Fix map value extraction when map contains collated strings
[ https://issues.apache.org/jira/browse/SPARK-48430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48430. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46758 [https://github.com/apache/spark/pull/46758] > Fix map value extraction when map contains collated strings > --- > > Key: SPARK-48430 > URL: https://issues.apache.org/jira/browse/SPARK-48430 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Assignee: Nikola Mandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Following queries return unexpected results: > {code:java} > select collation(map('a', 'b' collate utf8_binary_lcase)['a']); > select collation(element_at(map('a', 'b' collate utf8_binary_lcase), > 'a'));{code} > Both return UTF8_BINARY instead of UTF8_BINARY_LCASE. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48476) NPE thrown when delimiter set to null in CSV
[ https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48476. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46810 [https://github.com/apache/spark/pull/46810] > NPE thrown when delimiter set to null in CSV > > > Key: SPARK-48476 > URL: https://issues.apache.org/jira/browse/SPARK-48476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Milan Stefanovic >Assignee: Milan Stefanovic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > When customers specified delimiter to null, currently we throw NPE. We should > throw customer facing error > repro: > spark.read.format("csv") > .option("delimiter", null) > .load() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48476) NPE thrown when delimiter set to null in CSV
[ https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48476: --- Assignee: Milan Stefanovic > NPE thrown when delimiter set to null in CSV > > > Key: SPARK-48476 > URL: https://issues.apache.org/jira/browse/SPARK-48476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Milan Stefanovic >Assignee: Milan Stefanovic >Priority: Major > Labels: pull-request-available > > When customers specified delimiter to null, currently we throw NPE. We should > throw customer facing error > repro: > spark.read.format("csv") > .option("delimiter", null) > .load() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48490) Unescapes any literals for message of MessageWithContext
[ https://issues.apache.org/jira/browse/SPARK-48490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48490: --- Labels: pull-request-available (was: ) > Unescapes any literals for message of MessageWithContext > > > Key: SPARK-48490 > URL: https://issues.apache.org/jira/browse/SPARK-48490 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48484) V2Write use the same TaskAttemptId for different task attempts
[ https://issues.apache.org/jira/browse/SPARK-48484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-48484: Assignee: Jackey Lee > V2Write use the same TaskAttemptId for different task attempts > -- > > Key: SPARK-48484 > URL: https://issues.apache.org/jira/browse/SPARK-48484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Yang Jie >Assignee: Jackey Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48490) Unescapes any literals for message of MessageWithContext
BingKun Pan created SPARK-48490: --- Summary: Unescapes any literals for message of MessageWithContext Key: SPARK-48490 URL: https://issues.apache.org/jira/browse/SPARK-48490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48484) V2Write use the same TaskAttemptId for different task attempts
[ https://issues.apache.org/jira/browse/SPARK-48484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-48484. -- Fix Version/s: 3.4.4 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 46811 [https://github.com/apache/spark/pull/46811] > V2Write use the same TaskAttemptId for different task attempts > -- > > Key: SPARK-48484 > URL: https://issues.apache.org/jira/browse/SPARK-48484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 3.4.4, 3.5.2, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource
[ https://issues.apache.org/jira/browse/SPARK-48489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48489: --- Labels: pull-request-available (was: ) > Throw an user-facing error when reading invalid schema from text DataSource > --- > > Key: SPARK-48489 > URL: https://issues.apache.org/jira/browse/SPARK-48489 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Stefan Bukorovic >Priority: Minor > Labels: pull-request-available > > Text DataSource produces table schema with only 1 column, but it is possible > to try and create a table with schema having multiple columns. > Currently, when user tries this, we have an assert in the code, which fails > and throws internal spark error. We should throw a better user-facing error. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource
Stefan Bukorovic created SPARK-48489: Summary: Throw an user-facing error when reading invalid schema from text DataSource Key: SPARK-48489 URL: https://issues.apache.org/jira/browse/SPARK-48489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.3 Reporter: Stefan Bukorovic Text DataSource produces table schema with only 1 column, but it is possible to try and create a table with schema having multiple columns. Currently, when user tries this, we have an assert in the code, which fails and throws internal spark error. We should throw a better user-facing error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47690) Hash aggregate support for strings with collation
[ https://issues.apache.org/jira/browse/SPARK-47690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47690: --- Labels: pull-request-available (was: ) > Hash aggregate support for strings with collation > - > > Key: SPARK-47690 > URL: https://issues.apache.org/jira/browse/SPARK-47690 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`
[ https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48488: --- Labels: pull-request-available (was: ) > Restore the original logic of methods `log[info|warning|error]` in > `SparkSubmit` > > > Key: SPARK-48488 > URL: https://issues.apache.org/jira/browse/SPARK-48488 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`
BingKun Pan created SPARK-48488: --- Summary: Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit` Key: SPARK-48488 URL: https://issues.apache.org/jira/browse/SPARK-48488 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48487) Update License & Notice according to the dependency changes
[ https://issues.apache.org/jira/browse/SPARK-48487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48487: --- Labels: pull-request-available (was: ) > Update License & Notice according to the dependency changes > --- > > Key: SPARK-48487 > URL: https://issues.apache.org/jira/browse/SPARK-48487 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48487) Update License & Notice according to the dependency changes
Kent Yao created SPARK-48487: Summary: Update License & Notice according to the dependency changes Key: SPARK-48487 URL: https://issues.apache.org/jira/browse/SPARK-48487 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48476) NPE thrown when delimiter set to null in CSV
[ https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48476: -- Assignee: (was: Apache Spark) > NPE thrown when delimiter set to null in CSV > > > Key: SPARK-48476 > URL: https://issues.apache.org/jira/browse/SPARK-48476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Milan Stefanovic >Priority: Major > Labels: pull-request-available > > When customers specified delimiter to null, currently we throw NPE. We should > throw customer facing error > repro: > spark.read.format("csv") > .option("delimiter", null) > .load() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48476) NPE thrown when delimiter set to null in CSV
[ https://issues.apache.org/jira/browse/SPARK-48476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48476: -- Assignee: Apache Spark > NPE thrown when delimiter set to null in CSV > > > Key: SPARK-48476 > URL: https://issues.apache.org/jira/browse/SPARK-48476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Milan Stefanovic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > When customers specified delimiter to null, currently we throw NPE. We should > throw customer facing error > repro: > spark.read.format("csv") > .option("delimiter", null) > .load() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors
[ https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47258: -- Assignee: Apache Spark > Assign error classes to SHOW CREATE TABLE errors > > > Key: SPARK-47258 > URL: https://issues.apache.org/jira/browse/SPARK-47258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors
[ https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47258: -- Assignee: (was: Apache Spark) > Assign error classes to SHOW CREATE TABLE errors > > > Key: SPARK-47258 > URL: https://issues.apache.org/jira/browse/SPARK-47258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851000#comment-17851000 ] chenfengbin commented on SPARK-48486: - The reason for the problem starting from Spark 3.3 is that the method getFilterableTableScan in org.apache.spark.sql.execution.dynamicpruning.PartitionPruning has added support for HiveTableRelation. Then, after the traversal of plan transformUp in the prune method, amplification occurred during the recursion of insertPredicate in splitConjunctivePredicates(condition).foreach. > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > > > Key: SPARK-48486 > URL: https://issues.apache.org/jira/browse/SPARK-48486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chenfengbin >Priority: Major > Attachments: create_table.sql > > > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > for example: The partition field of table A or B is part_dt. > select a.part_dt,{*}b.part_dt{*} > from a left join b on b.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > During the generation of Dynamic Partition Pruning (DPP), the tree will be > traversed recursively more times. > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#223 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 > [part_dt#223|#223] > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning > directories with: part_dt#219 IN (2024-05-30,2024-05-31) > When more partitions meet the condition, it will increase by 2 to the power > of n, then divided by 2. > for example: > select a.part_dt{*},b.part_dt{*},c.part_dt > from a > left join b on b.part_dt = a.part_dt > left join c on c.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > result: > 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#257 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 > [part_dt#257|#257] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > The last four lines are all superfluous. > When more than 10 conditions are met, it will cause a memory overflow when > generating DPP, and the generated plan will be about 400 times larger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenfengbin updated SPARK-48486: Attachment: create_table.sql > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > > > Key: SPARK-48486 > URL: https://issues.apache.org/jira/browse/SPARK-48486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chenfengbin >Priority: Major > Attachments: create_table.sql > > > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > for example: The partition field of table A or B is part_dt. > select a.part_dt,{*}b.part_dt{*} > from a left join b on b.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > During the generation of Dynamic Partition Pruning (DPP), the tree will be > traversed recursively more times. > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#223 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 > [part_dt#223|#223] > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning > directories with: part_dt#219 IN (2024-05-30,2024-05-31) > When more partitions meet the condition, it will increase by 2 to the power > of n, then divided by 2. > for example: > select a.part_dt{*},b.part_dt{*},c.part_dt > from a > left join b on b.part_dt = a.part_dt > left join c on c.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > result: > 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#257 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 > [part_dt#257|#257] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > The last four lines are all superfluous. > When more than 10 conditions are met, it will cause a memory overflow when > generating DPP, and the generated plan will be about 400 times larger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenfengbin updated SPARK-48486: Attachment: create_table.sql.rtf > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > > > Key: SPARK-48486 > URL: https://issues.apache.org/jira/browse/SPARK-48486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chenfengbin >Priority: Major > > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > for example: The partition field of table A or B is part_dt. > select a.part_dt,{*}b.part_dt{*} > from a left join b on b.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > During the generation of Dynamic Partition Pruning (DPP), the tree will be > traversed recursively more times. > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#223 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 > [part_dt#223|#223] > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning > directories with: part_dt#219 IN (2024-05-30,2024-05-31) > When more partitions meet the condition, it will increase by 2 to the power > of n, then divided by 2. > for example: > select a.part_dt{*},b.part_dt{*},c.part_dt > from a > left join b on b.part_dt = a.part_dt > left join c on c.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > result: > 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#257 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 > [part_dt#257|#257] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > The last four lines are all superfluous. > When more than 10 conditions are met, it will cause a memory overflow when > generating DPP, and the generated plan will be about 400 times larger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenfengbin updated SPARK-48486: Attachment: (was: create_table.sql.rtf) > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > > > Key: SPARK-48486 > URL: https://issues.apache.org/jira/browse/SPARK-48486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chenfengbin >Priority: Major > > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > for example: The partition field of table A or B is part_dt. > select a.part_dt,{*}b.part_dt{*} > from a left join b on b.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > During the generation of Dynamic Partition Pruning (DPP), the tree will be > traversed recursively more times. > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#223 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 > [part_dt#223|#223] > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning > directories with: part_dt#219 IN (2024-05-30,2024-05-31) > When more partitions meet the condition, it will increase by 2 to the power > of n, then divided by 2. > for example: > select a.part_dt{*},b.part_dt{*},c.part_dt > from a > left join b on b.part_dt = a.part_dt > left join c on c.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > result: > 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#257 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 > [part_dt#257|#257] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#253 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 > [part_dt#253|#253] > 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: > part_dt#249 IN (2024-05-30,2024-05-31) > The last four lines are all superfluous. > When more than 10 conditions are met, it will cause a memory overflow when > generating DPP, and the generated plan will be about 400 times larger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenfengbin updated SPARK-48486: Description: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,{*}b.part_dt{*} from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223|#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt{*},b.part_dt{*},c.part_dt from a left join b on b.part_dt = a.part_dt left join c on c.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') result: 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253|#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#257 IN (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 [part_dt#257|#257] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253|#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) The last four lines are all superfluous. When more than 10 conditions are met, it will cause a memory overflow when generating DPP, and the generated plan will be about 400 times larger. was: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.{*},b.part_dt,b.{*} from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223|#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt{*},b.part_dt{*},c.part_dt from a left join b on b.part_dt = a.part_dt left join c on c.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') result: 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253|#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#257 IN (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 [part_dt#257|#257] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253|#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) The last four lines are all superfluous.
[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenfengbin updated SPARK-48486: Description: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.{*},b.part_dt,b.{*} from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223|#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt{*},b.part_dt{*},c.part_dt from a left join b on b.part_dt = a.part_dt left join c on c.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') result: 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253|#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#257 IN (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 [part_dt#257|#257] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253|#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) The last four lines are all superfluous. When more than 10 conditions are met, it will cause a memory overflow when generating DPP, and the generated plan will be about 400 times larger. was: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.{*},b.part_dt,b.{*} from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223|#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt{*},b.part_dt{*},c.part_dt from a left join b on b.part_dt = a.part_dt left join c on c.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') result: 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#257 IN (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 [part_dt#257] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) The last four lines are all superfluous. > The
[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenfengbin updated SPARK-48486: Description: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.{*},b.part_dt,b.{*} from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223|#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt{*},b.part_dt{*},c.part_dt from a left join b on b.part_dt = a.part_dt left join c on c.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') result: 24/05/31 16:29:19 INFO ExecuteStatement: Execute in full collect mode 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#257 IN (2024-05-30,2024-05-31),isnotnull(part_dt#257),dynamicpruning#262 [part_dt#257] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#253 IN (2024-05-30,2024-05-31),isnotnull(part_dt#253),dynamicpruning#261 [part_dt#253] 24/05/31 16:29:19 INFO DataSourceStrategy: Pruning directories with: part_dt#249 IN (2024-05-30,2024-05-31) The last four lines are all superfluous. was: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.{*},b.part_dt,b.{*} from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223|#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt,a.*,b.part_dt,b.*,c.part_dt,c.* from a left join b on b.part_dt = a.part_dt left join c on c.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') result: > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > > > Key: SPARK-48486 > URL: https://issues.apache.org/jira/browse/SPARK-48486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chenfengbin >Priority: Major > > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > for example: The partition field of table A or B is part_dt. > select a.part_dt,a.{*},b.part_dt,b.{*} > from a left join b on b.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > During the generation of Dynamic Partition Pruning (DPP), the tree will be > traversed recursively more times. > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#223 IN >
[jira] [Updated] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
[ https://issues.apache.org/jira/browse/SPARK-48486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenfengbin updated SPARK-48486: Description: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.{*},b.part_dt,b.{*} from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223|#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt,a.*,b.part_dt,b.*,c.part_dt,c.* from a left join b on b.part_dt = a.part_dt left join c on c.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') result: was: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.*,b.part_dt,b.* from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt,a.*,b.part_dt,b.* from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > > > Key: SPARK-48486 > URL: https://issues.apache.org/jira/browse/SPARK-48486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chenfengbin >Priority: Major > > The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated > Directed Acyclic Graph (DAG) to expand. > for example: The partition field of table A or B is part_dt. > select a.part_dt,a.{*},b.part_dt,b.{*} > from a left join b on b.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > During the generation of Dynamic Partition Pruning (DPP), the tree will be > traversed recursively more times. > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#223 IN > (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 > [part_dt#223|#223] > 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: > part_dt#219 IN (2024-05-30,2024-05-31) > The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning > directories with: part_dt#219 IN (2024-05-30,2024-05-31) > When more partitions meet the condition, it will increase by 2 to the power > of n, then divided by 2. > for example: > select a.part_dt,a.*,b.part_dt,b.*,c.part_dt,c.* > from a > left join b on b.part_dt = a.part_dt > left join c on c.part_dt = a.part_dt > where a.part_dt in ('2024-05-30','2024-05-31') > result: > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48486) The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand.
chenfengbin created SPARK-48486: --- Summary: The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. Key: SPARK-48486 URL: https://issues.apache.org/jira/browse/SPARK-48486 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: chenfengbin The Dynamic Partition Pruning (DPP) feature in Spark can cause the generated Directed Acyclic Graph (DAG) to expand. for example: The partition field of table A or B is part_dt. select a.part_dt,a.*,b.part_dt,b.* from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') During the generation of Dynamic Partition Pruning (DPP), the tree will be traversed recursively more times. 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#223 IN (2024-05-30,2024-05-31),isnotnull(part_dt#223),dynamicpruning#234 [part_dt#223] 24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) The last one is extra:24/05/31 16:14:14 INFO DataSourceStrategy: Pruning directories with: part_dt#219 IN (2024-05-30,2024-05-31) When more partitions meet the condition, it will increase by 2 to the power of n, then divided by 2. for example: select a.part_dt,a.*,b.part_dt,b.* from a left join b on b.part_dt = a.part_dt where a.part_dt in ('2024-05-30','2024-05-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48485) Support interruptTag and interruptAll in streaming queries
[ https://issues.apache.org/jira/browse/SPARK-48485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48485: --- Labels: pull-request-available (was: ) > Support interruptTag and interruptAll in streaming queries > -- > > Key: SPARK-48485 > URL: https://issues.apache.org/jira/browse/SPARK-48485 > Project: Spark > Issue Type: Improvement > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > Spark Connect's interrupt API does not interrupt streaming queries. We should > support them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48485) Support interruptTag and interruptAll in streaming queries
Hyukjin Kwon created SPARK-48485: Summary: Support interruptTag and interruptAll in streaming queries Key: SPARK-48485 URL: https://issues.apache.org/jira/browse/SPARK-48485 Project: Spark Issue Type: Improvement Components: Connect, Structured Streaming Affects Versions: 4.0.0 Reporter: Hyukjin Kwon Spark Connect's interrupt API does not interrupt streaming queries. We should support them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38506) Push partial aggregation through join
[ https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-38506: --- Labels: pull-request-available (was: ) > Push partial aggregation through join > - > > Key: SPARK-38506 > URL: https://issues.apache.org/jira/browse/SPARK-38506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > > Please see > https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org