Is your spark job batch or streaming? ________________________________ From: Sandeep Vinayak <vnayak...@gmail.com> Sent: Tuesday, October 18, 2022 19:48 To: dev@spark.apache.org <dev@spark.apache.org> Subject: Missing data in spark output
EXTERNAL SENDER. Do not click links or open attachments unless you recognize the sender and know the content is safe. DO NOT provide your username or password. Hello Everyone, We are recently observing an intermittent data loss in the spark with output to GCS (google cloud storage). When there are missing rows, they are accompanied by duplicate rows. The re-run of the job doesn't have any duplicate or missing rows. Since it's hard to debug, we are first trying to understand the potential theoretical root cause of this issue, can this be a GCS specific issue where GCS might not be handling the consistencies well? Any tips will be super helpful. Thanks,