[
https://issues.apache.org/jira/browse/SPARK-52401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-52401:
-----------------------------------
Labels: pull-request-available (was: )
> Unexpected behavior when using .collect() after modifying an external Spark
> table
> ---------------------------------------------------------------------------------
>
> Key: SPARK-52401
> URL: https://issues.apache.org/jira/browse/SPARK-52401
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 3.1.2
> Reporter: Rodrigo Cardoso
> Priority: Critical
> Labels: pull-request-available
>
> I encountered an inconsistency when using {{.collect()}} on a DataFrame that
> references an external Spark table. Specifically, after appending new data to
> the table, {{.count()}} returns the expected number of rows, but
> {{.collect()}} does not reflect the update and returns outdated results.
> *Steps to Reproduce:*
> The following snippet demonstrates the issue:
> {code:java}
> import pyspark
> from pyspark.sql.types import StructField, StructType, IntegerType, StringType
> spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()
> schema = StructType(
> [
> StructField("col1", IntegerType(), True),
> StructField("col2", StringType(), True),
> ]
> )
> table_name = "my_table"
> spark.createDataFrame(data=[], schema=schema).write.saveAsTable(
> name=table_name, mode="append", path="<choose a path>"
> )
> df = spark.table(table_name)
> assert df.count() == 0
> assert df.collect() == []
> spark.createDataFrame([(1, "foo")], schema).write.mode("append").saveAsTable(
> table_name
> )
> assert df.count() == 1
> assert len(df.collect()) == 1 # This fails {code}
> *Expected Behavior:*
> After appending data to the table, both {{.count()}} and {{.collect()}}
> should reflect the updated contents.
> *Observed Behavior:*
> {{.count()}} correctly returns {{{}1{}}}, but {{.collect()}} still returns an
> empty list.
> *Question:*
> Is Spark caching something? Why does {{.count()}} reflect the update while
> {{.collect()}} does not?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]