[ https://issues.apache.org/jira/browse/SPARK-23006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-23006. ------------------------------- Resolution: Invalid > S3 sync issue > ------------- > > Key: SPARK-23006 > URL: https://issues.apache.org/jira/browse/SPARK-23006 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core > Affects Versions: 2.1.1 > Environment: AWS EMR 5.7 > Reporter: Dhruv sharma > Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Using S3 to read and write for our various jobs. > Facing some read and write inconsistencies, which are leading to the job > failures. > Details are as below: > - WRITE ERRORS : > While writing to s3 have seen the exception "File already exists". > Understanding is while writing to s3 if some executors die then the same > task is delegated to another executor, which then retries the write. > But file being partially written gives the above exception. > Tried tuning the time multiplier so that WRITE tasks are not killed. Its > working but still not a robust solution. > - READ AFTER WRITE ERRORS > One of the jobs is deleting the S3 data and then writing the data to the > location. > S3 doesn't guarantee immediate deletion. > The second job when tries to read the location using [ > spark.read.json("bucket/key1/*") ] > Gives an exception "FILE_NOT_FOUND". > Reason being it lists the data that is deleted but still that deletion is > not synced. > Is there a way we can tune our spark configuration, to remove such exceptions. > Further what should be kept in mind while interacting with S3 for read and > write. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org