Re: Missing data in spark output

2022-10-25 Thread Steve Loughran
v1 on gcs isn't safe either as promotion from task attempt to
successful task is a dir rename; fast and atomic on hdfs, O(files) and
nonatomic on GCS.

if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be
there to test  https://issues.apache.org/jira/browse/MAPREDUCE-7341

until then, as chris says, turn off speculative execution

On Fri, 21 Oct 2022 at 23:39, Chris Nauroth  wrote:

> Some users have observed issues like what you're describing related to the
> job commit algorithm, which is controlled by configuration
> property spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.
> Hadoop's default value for this setting is 2. You can find a description of
> the algorithms in Hadoop's configuration documentation:
>
>
> https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
>
> Algorithm version 2 is faster, because the final task output file renames
> can be issued in parallel by individual tasks. Unfortunately, there have
> been reports of it causing side effects like what you described, especially
> if there are a lot of task attempt retries or speculative execution
> (configuration property spark.speculation set to true instead of the
> default false). You could try switching to algorithm version 1. The
> drawback is that it's slower, because the final output renames are executed
> single-threaded at the end of the job. The performance impact is more
> noticeable for jobs with many tasks, and the effect is amplified when using
> cloud storage as opposed to HDFS running in the same network.
>
> If you are using speculative execution, then you could also potentially
> try turning that off.
>
> Chris Nauroth
>
>
> On Wed, Oct 19, 2022 at 8:18 AM Martin Andersson <
> martin.anders...@kambi.com> wrote:
>
>> Is your spark job batch or streaming?
>> --
>> *From:* Sandeep Vinayak 
>> *Sent:* Tuesday, October 18, 2022 19:48
>> *To:* dev@spark.apache.org 
>> *Subject:* Missing data in spark output
>>
>>
>> EXTERNAL SENDER. Do not click links or open attachments unless you
>> recognize the sender and know the content is safe. DO NOT provide your
>> username or password.
>>
>> Hello Everyone,
>>
>> We are recently observing an intermittent data loss in the spark with
>> output to GCS (google cloud storage). When there are missing rows, they are
>> accompanied by duplicate rows. The re-run of the job doesn't have any
>> duplicate or missing rows. Since it's hard to debug, we are first trying to
>> understand the potential theoretical root cause of this issue, can this be
>> a GCS specific issue where GCS might not be handling the consistencies
>> well? Any tips will be super helpful.
>>
>> Thanks,
>>
>>


Re: Missing data in spark output

2022-10-21 Thread Chris Nauroth
Some users have observed issues like what you're describing related to the
job commit algorithm, which is controlled by configuration
property spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.
Hadoop's default value for this setting is 2. You can find a description of
the algorithms in Hadoop's configuration documentation:

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Algorithm version 2 is faster, because the final task output file renames
can be issued in parallel by individual tasks. Unfortunately, there have
been reports of it causing side effects like what you described, especially
if there are a lot of task attempt retries or speculative execution
(configuration property spark.speculation set to true instead of the
default false). You could try switching to algorithm version 1. The
drawback is that it's slower, because the final output renames are executed
single-threaded at the end of the job. The performance impact is more
noticeable for jobs with many tasks, and the effect is amplified when using
cloud storage as opposed to HDFS running in the same network.

If you are using speculative execution, then you could also potentially try
turning that off.

Chris Nauroth


On Wed, Oct 19, 2022 at 8:18 AM Martin Andersson 
wrote:

> Is your spark job batch or streaming?
> --
> *From:* Sandeep Vinayak 
> *Sent:* Tuesday, October 18, 2022 19:48
> *To:* dev@spark.apache.org 
> *Subject:* Missing data in spark output
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> Hello Everyone,
>
> We are recently observing an intermittent data loss in the spark with
> output to GCS (google cloud storage). When there are missing rows, they are
> accompanied by duplicate rows. The re-run of the job doesn't have any
> duplicate or missing rows. Since it's hard to debug, we are first trying to
> understand the potential theoretical root cause of this issue, can this be
> a GCS specific issue where GCS might not be handling the consistencies
> well? Any tips will be super helpful.
>
> Thanks,
>
>


Re: Missing data in spark output

2022-10-19 Thread Martin Andersson
Is your spark job batch or streaming?

From: Sandeep Vinayak 
Sent: Tuesday, October 18, 2022 19:48
To: dev@spark.apache.org 
Subject: Missing data in spark output


EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


Hello Everyone,

We are recently observing an intermittent data loss in the spark with output to 
GCS (google cloud storage). When there are missing rows, they are accompanied 
by duplicate rows. The re-run of the job doesn't have any duplicate or missing 
rows. Since it's hard to debug, we are first trying to understand the potential 
theoretical root cause of this issue, can this be a GCS specific issue where 
GCS might not be handling the consistencies well? Any tips will be super 
helpful.

Thanks,



Re: Missing data in spark output

2022-10-18 Thread Emil Ejbyfeldt

Hi,

We have observed similar behavior in older versions of spark. But we 
were are currently using 3.3.0 where we have not seen such issues.


Which version of Spark and Hadoop are you using?

On 18/10/2022 19:48, Sandeep Vinayak wrote:

Hello Everyone,

We are recently observing an intermittent data loss in the spark with 
output to GCS (google cloud storage). When there are missing rows, they 
are accompanied by duplicate rows. The re-run of the job doesn't have 
any duplicate or missing rows. Since it's hard to debug, we are first 
trying to understand the potential theoretical root cause of this issue, 
can this be a GCS specific issue where GCS might not be handling the 
consistencies well? Any tips will be super helpful.


Thanks,



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Missing data in spark output

2022-10-18 Thread Sandeep Vinayak
Hello Everyone,

We are recently observing an intermittent data loss in the spark with
output to GCS (google cloud storage). When there are missing rows, they are
accompanied by duplicate rows. The re-run of the job doesn't have any
duplicate or missing rows. Since it's hard to debug, we are first trying to
understand the potential theoretical root cause of this issue, can this be
a GCS specific issue where GCS might not be handling the consistencies
well? Any tips will be super helpful.

Thanks,