Hi Team,

I am trying to display the counts of the DF which is created by running one
Spark SQL query with a CTE pattern.

Everything is working as expected. I was able to write the DF to Postgres
RDS. However, when I am trying to display the counts using a simple count()
action it leads to the below error:

py4j.protocol.Py4JJavaError: An error occurred while calling o321.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
1301 in stage 35.0 failed 4 times, most recent failure: Lost task 1301.3 in
stage 35.0 (TID 7889, 10.100.6.148, executor 1):
java.io.FileNotFoundException: File not present on S3
It is possible the underlying files have been updated. You can explicitly
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
in SQL or by recreating the Dataset/DataFrame involved.


So, I tried something like the below:

print(modifiedData.repartition(modifiedData.rdd.getNumPartitions()).count())

So, there are 80 partitions being formed for this DF, and the count written
in Table is 92,665. However, it didn't match with the count displayed post
repartitioning which was 91,183

Not sure why is this gap?

Why the counts are not matching? Also what could be the possible reason for
that simple count error?

Environment:
AWS GLUE 1.X
10 workers
Spark 2.4.3

Thanks,
Sid

Reply via email to