Is your spark job batch or streaming?
________________________________
From: Sandeep Vinayak <vnayak...@gmail.com>
Sent: Tuesday, October 18, 2022 19:48
To: dev@spark.apache.org <dev@spark.apache.org>
Subject: Missing data in spark output


EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


Hello Everyone,

We are recently observing an intermittent data loss in the spark with output to 
GCS (google cloud storage). When there are missing rows, they are accompanied 
by duplicate rows. The re-run of the job doesn't have any duplicate or missing 
rows. Since it's hard to debug, we are first trying to understand the potential 
theoretical root cause of this issue, can this be a GCS specific issue where 
GCS might not be handling the consistencies well? Any tips will be super 
helpful.

Thanks,

Reply via email to