[ https://issues.apache.org/jira/browse/SPARK-52401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rodrigo Cardoso updated SPARK-52401: ------------------------------------ Description: I encountered an inconsistency when using {{.collect()}} on a DataFrame that references an external Spark table. Specifically, after appending new data to the table, {{.count()}} returns the expected number of rows, but {{.collect()}} does not reflect the update and returns outdated results. *Steps to Reproduce:* The following snippet demonstrates the issue: {code:java} import pyspark from pyspark.sql.types import StructField, StructType, IntegerType, StringType spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate() schema = StructType( [ StructField("col1", IntegerType(), True), StructField("col2", StringType(), True), ] ) table_name = "my_table" spark.createDataFrame(data=[], schema=schema).write.saveAsTable( name=table_name, mode="append", path="<choose a path>" ) df = spark.table(table_name) assert df.count() == 0 assert df.collect() == [] spark.createDataFrame([(1, "foo")], schema).write.mode("append").saveAsTable( table_name ) assert df.count() == 1 assert len(df.collect()) == 1 # This fails {code} *Expected Behavior:* After appending data to the table, both {{.count()}} and {{.collect()}} should reflect the updated contents. *Observed Behavior:* {{.count()}} correctly returns {{{}1{}}}, but {{.collect()}} still returns an empty list. *Question:* Is Spark caching something? Why does {{.count()}} reflect the update while {{.collect()}} does not? was: I encountered an inconsistency when using {{.collect()}} on a DataFrame that references an external Spark table. Specifically, after appending new data to the table, {{.count()}} returns the expected number of rows, but {{.collect()}} does not reflect the update and returns outdated results. *Steps to Reproduce:* The following snippet demonstrates the issue: {code:java} import pyspark from pyspark.sql.types import StructField, StructType, IntegerType, StringType spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate() schema = StructType( [ StructField("col1", IntegerType(), True), StructField("col2", StringType(), True), ] ) table_name = "my_table" spark.createDataFrame(data=[], schema=schema).write.saveAsTable( name=table_name, mode="append", path="<choose a path>" ) df = spark.table(table_name) assert df.count() == 0 assert df.collect() == [] spark.createDataFrame([(1, "foo")], schema).write.mode("append").saveAsTable( table_name ) assert df.count() == 1 assert len(df.collect()) == 1 # This fails {code} *Expected Behavior:* After appending data to the table, both {{.count()}} and {{.collect()}} should reflect the updated contents. *Observed Behavior:* {{.count()}} correctly returns {{{}1{}}}, but {{.collect()}} still returns an empty list. *Question:* Is Spark caching the execution plan for {{{}df{}}}? Why does {{.count()}} reflect the update while {{.collect()}} does not? > Unexpected behavior when using .collect() after modifying an external Spark > table > --------------------------------------------------------------------------------- > > Key: SPARK-52401 > URL: https://issues.apache.org/jira/browse/SPARK-52401 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 3.1.2 > Reporter: Rodrigo Cardoso > Priority: Critical > > I encountered an inconsistency when using {{.collect()}} on a DataFrame that > references an external Spark table. Specifically, after appending new data to > the table, {{.count()}} returns the expected number of rows, but > {{.collect()}} does not reflect the update and returns outdated results. > *Steps to Reproduce:* > The following snippet demonstrates the issue: > {code:java} > import pyspark > from pyspark.sql.types import StructField, StructType, IntegerType, StringType > spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate() > schema = StructType( > [ > StructField("col1", IntegerType(), True), > StructField("col2", StringType(), True), > ] > ) > table_name = "my_table" > spark.createDataFrame(data=[], schema=schema).write.saveAsTable( > name=table_name, mode="append", path="<choose a path>" > ) > df = spark.table(table_name) > assert df.count() == 0 > assert df.collect() == [] > spark.createDataFrame([(1, "foo")], schema).write.mode("append").saveAsTable( > table_name > ) > assert df.count() == 1 > assert len(df.collect()) == 1 # This fails {code} > *Expected Behavior:* > After appending data to the table, both {{.count()}} and {{.collect()}} > should reflect the updated contents. > *Observed Behavior:* > {{.count()}} correctly returns {{{}1{}}}, but {{.collect()}} still returns an > empty list. > *Question:* > Is Spark caching something? Why does {{.count()}} reflect the update while > {{.collect()}} does not? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org