Hi Team, I am trying to display the counts of the DF which is created by running one Spark SQL query with a CTE pattern.
Everything is working as expected. I was able to write the DF to Postgres RDS. However, when I am trying to display the counts using a simple count() action it leads to the below error: py4j.protocol.Py4JJavaError: An error occurred while calling o321.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1301 in stage 35.0 failed 4 times, most recent failure: Lost task 1301.3 in stage 35.0 (TID 7889, 10.100.6.148, executor 1): java.io.FileNotFoundException: File not present on S3 It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. So, I tried something like the below: print(modifiedData.repartition(modifiedData.rdd.getNumPartitions()).count()) So, there are 80 partitions being formed for this DF, and the count written in Table is 92,665. However, it didn't match with the count displayed post repartitioning which was 91,183 Not sure why is this gap? Why the counts are not matching? Also what could be the possible reason for that simple count error? Environment: AWS GLUE 1.X 10 workers Spark 2.4.3 Thanks, Sid