[ 
https://issues.apache.org/jira/browse/SPARK-52401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rodrigo Cardoso updated SPARK-52401:
------------------------------------
    Description: 
I encountered an inconsistency when using {{.collect()}} on a DataFrame that 
references an external Spark table. Specifically, after appending new data to 
the table, {{.count()}} returns the expected number of rows, but {{.collect()}} 
does not reflect the update and returns outdated results.

*Steps to Reproduce:*
The following snippet demonstrates the issue:
{code:java}
import pyspark
from pyspark.sql.types import StructField, StructType, IntegerType, StringType

spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()

schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
]
)

table_name = "my_table"
spark.createDataFrame(data=[], schema=schema).write.saveAsTable(
name=table_name, mode="append", path="<choose a path>"
)

df = spark.table(table_name)

assert df.count() == 0
assert df.collect() == []

spark.createDataFrame([(1, "foo")], schema).write.mode("append").saveAsTable(
table_name
)
assert df.count() == 1
assert len(df.collect()) == 1 # This fails {code}
*Expected Behavior:*
After appending data to the table, both {{.count()}} and {{.collect()}} should 
reflect the updated contents.

*Observed Behavior:*
{{.count()}} correctly returns {{{}1{}}}, but {{.collect()}} still returns an 
empty list.

*Question:*
Is Spark caching something? Why does {{.count()}} reflect the update while 
{{.collect()}} does not?

  was:
I encountered an inconsistency when using {{.collect()}} on a DataFrame that 
references an external Spark table. Specifically, after appending new data to 
the table, {{.count()}} returns the expected number of rows, but {{.collect()}} 
does not reflect the update and returns outdated results.

*Steps to Reproduce:*
The following snippet demonstrates the issue:
{code:java}
import pyspark
from pyspark.sql.types import StructField, StructType, IntegerType, StringType

spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()

schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
]
)

table_name = "my_table"
spark.createDataFrame(data=[], schema=schema).write.saveAsTable(
name=table_name, mode="append", path="<choose a path>"
)

df = spark.table(table_name)

assert df.count() == 0
assert df.collect() == []

spark.createDataFrame([(1, "foo")], schema).write.mode("append").saveAsTable(
table_name
)
assert df.count() == 1
assert len(df.collect()) == 1 # This fails {code}
*Expected Behavior:*
After appending data to the table, both {{.count()}} and {{.collect()}} should 
reflect the updated contents.

*Observed Behavior:*
{{.count()}} correctly returns {{{}1{}}}, but {{.collect()}} still returns an 
empty list.

*Question:*
Is Spark caching the execution plan for {{{}df{}}}? Why does {{.count()}} 
reflect the update while {{.collect()}} does not?


> Unexpected behavior when using .collect() after modifying an external Spark 
> table
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-52401
>                 URL: https://issues.apache.org/jira/browse/SPARK-52401
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.1.2
>            Reporter: Rodrigo Cardoso
>            Priority: Critical
>
> I encountered an inconsistency when using {{.collect()}} on a DataFrame that 
> references an external Spark table. Specifically, after appending new data to 
> the table, {{.count()}} returns the expected number of rows, but 
> {{.collect()}} does not reflect the update and returns outdated results.
> *Steps to Reproduce:*
> The following snippet demonstrates the issue:
> {code:java}
> import pyspark
> from pyspark.sql.types import StructField, StructType, IntegerType, StringType
> spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()
> schema = StructType(
> [
> StructField("col1", IntegerType(), True),
> StructField("col2", StringType(), True),
> ]
> )
> table_name = "my_table"
> spark.createDataFrame(data=[], schema=schema).write.saveAsTable(
> name=table_name, mode="append", path="<choose a path>"
> )
> df = spark.table(table_name)
> assert df.count() == 0
> assert df.collect() == []
> spark.createDataFrame([(1, "foo")], schema).write.mode("append").saveAsTable(
> table_name
> )
> assert df.count() == 1
> assert len(df.collect()) == 1 # This fails {code}
> *Expected Behavior:*
> After appending data to the table, both {{.count()}} and {{.collect()}} 
> should reflect the updated contents.
> *Observed Behavior:*
> {{.count()}} correctly returns {{{}1{}}}, but {{.collect()}} still returns an 
> empty list.
> *Question:*
> Is Spark caching something? Why does {{.count()}} reflect the update while 
> {{.collect()}} does not?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to