[jira] [Updated] (SPARK-23006) S3 sync issue

Dhruv sharma (JIRA) Tue, 09 Jan 2018 05:23:49 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-23006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dhruv sharma updated SPARK-23006:
---------------------------------
    Description: 
I am using S3 to read and write for our various jobs.
And facing some read and write inconsistencies, which are leading to the job 
failures.
Details are as below:

- WRITE ERRORS : 
     While writing to s3 I have seen the exception "File already exists". 
     Understanding is, while writing to s3 if some executors die then the same 
task is delegated to another executor, which then retries the write. But file 
being partially written gives us the above exception.
  Tried tuning the time multiplier so that WRITE tasks are not killed. It's 
working for now but when it might stop working, not sure.

- READ AFTER WRITE ERRORS
    One of the jobs is deleting the S3 data and then writing the data to the 
location.
    S3 doesn't guarantee immediate deletion.
    The second job when tries to read the location using [ 
spark.read.json("bucket/key1/*") ] 
    Gives an exception "FILE_NOT_FOUND".
    Reason being it lists the data that is deleted but still that deletion is 
not synced.

Is there a way we can tune our spark configuration, to remove such exceptions.
Further what should be kept in mind while interacting with S3 for read and 
write.

  was:
I am using S3 to read and write for our various jobs.
And facing some read and write inconsistencies, which are leading to the job 
failures.
Details are as below:

- WRITE ERRORS : 
     while writing to s3 I have seen the exception "File already exists". 
    My understanding is while writing to s3 if some executors die then the same 
task is delegated to another executor, which then retries the write. But file 
being partially written gives us the above exception.
Tried tuning the time multiplier so that WRITE tasks are not killed. It's 
working for now but when it might stop working, not sure.

- READ AFTER WRITE ERRORS
One of the jobs is dealing with S3 using overwrite mode.
It first deletes the files from the folder and then re-write to it.
S3 doesn't guarantee immediate deletion.
The time I get the list of objects using spark.read.json("bucket/key1/*") , It 
generates another exception "FILE NOT FOUND". Because anyway the file is 
deleted.
  Also, have faced the issue when reading after new write as list operation is 
not strongly consistent there are errors like "FILE NOT FOUND or KEY NOT FOUND".


Is there a way we can tune our spark configuration, to remove such exceptions.
Further what should be kept in mind while interacting with S3 for read and 
write.


> S3 sync issue
> -------------
>
>                 Key: SPARK-23006
>                 URL: https://issues.apache.org/jira/browse/SPARK-23006
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: Spark Core
>    Affects Versions: 2.1.1
>         Environment: AWS EMR 5.7
>            Reporter: Dhruv sharma
>            Priority: Blocker
>              Labels: aws, emr, hadoop, read, s3, spark, write
>             Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using S3 to read and write for our various jobs.
> And facing some read and write inconsistencies, which are leading to the job 
> failures.
> Details are as below:
> - WRITE ERRORS : 
>      While writing to s3 I have seen the exception "File already exists". 
>      Understanding is, while writing to s3 if some executors die then the 
> same task is delegated to another executor, which then retries the write. But 
> file being partially written gives us the above exception.
>   Tried tuning the time multiplier so that WRITE tasks are not killed. It's 
> working for now but when it might stop working, not sure.
> - READ AFTER WRITE ERRORS
>     One of the jobs is deleting the S3 data and then writing the data to the 
> location.
>     S3 doesn't guarantee immediate deletion.
>     The second job when tries to read the location using [ 
> spark.read.json("bucket/key1/*") ] 
>     Gives an exception "FILE_NOT_FOUND".
>     Reason being it lists the data that is deleted but still that deletion is 
> not synced.
> Is there a way we can tune our spark configuration, to remove such exceptions.
> Further what should be kept in mind while interacting with S3 for read and 
> write.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23006) S3 sync issue

Reply via email to