[ https://issues.apache.org/jira/browse/SPARK-23006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dhruv sharma updated SPARK-23006: --------------------------------- Description: I am using S3 to read and write for our various jobs. And facing some read and write inconsistencies, which are leading to the job failures. Details are as below: - WRITE ERRORS : While writing to s3 I have seen the exception "File already exists". Understanding is, while writing to s3 if some executors die then the same task is delegated to another executor, which then retries the write. But file being partially written gives us the above exception. Tried tuning the time multiplier so that WRITE tasks are not killed. It's working for now but when it might stop working, not sure. - READ AFTER WRITE ERRORS One of the jobs is deleting the S3 data and then writing the data to the location. S3 doesn't guarantee immediate deletion. The second job when tries to read the location using [ spark.read.json("bucket/key1/*") ] Gives an exception "FILE_NOT_FOUND". Reason being it lists the data that is deleted but still that deletion is not synced. Is there a way we can tune our spark configuration, to remove such exceptions. Further what should be kept in mind while interacting with S3 for read and write. was: I am using S3 to read and write for our various jobs. And facing some read and write inconsistencies, which are leading to the job failures. Details are as below: - WRITE ERRORS : while writing to s3 I have seen the exception "File already exists". My understanding is while writing to s3 if some executors die then the same task is delegated to another executor, which then retries the write. But file being partially written gives us the above exception. Tried tuning the time multiplier so that WRITE tasks are not killed. It's working for now but when it might stop working, not sure. - READ AFTER WRITE ERRORS One of the jobs is dealing with S3 using overwrite mode. It first deletes the files from the folder and then re-write to it. S3 doesn't guarantee immediate deletion. The time I get the list of objects using spark.read.json("bucket/key1/*") , It generates another exception "FILE NOT FOUND". Because anyway the file is deleted. Also, have faced the issue when reading after new write as list operation is not strongly consistent there are errors like "FILE NOT FOUND or KEY NOT FOUND". Is there a way we can tune our spark configuration, to remove such exceptions. Further what should be kept in mind while interacting with S3 for read and write. > S3 sync issue > ------------- > > Key: SPARK-23006 > URL: https://issues.apache.org/jira/browse/SPARK-23006 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core > Affects Versions: 2.1.1 > Environment: AWS EMR 5.7 > Reporter: Dhruv sharma > Priority: Blocker > Labels: aws, emr, hadoop, read, s3, spark, write > Fix For: 2.1.1 > > Original Estimate: 24h > Remaining Estimate: 24h > > I am using S3 to read and write for our various jobs. > And facing some read and write inconsistencies, which are leading to the job > failures. > Details are as below: > - WRITE ERRORS : > While writing to s3 I have seen the exception "File already exists". > Understanding is, while writing to s3 if some executors die then the > same task is delegated to another executor, which then retries the write. But > file being partially written gives us the above exception. > Tried tuning the time multiplier so that WRITE tasks are not killed. It's > working for now but when it might stop working, not sure. > - READ AFTER WRITE ERRORS > One of the jobs is deleting the S3 data and then writing the data to the > location. > S3 doesn't guarantee immediate deletion. > The second job when tries to read the location using [ > spark.read.json("bucket/key1/*") ] > Gives an exception "FILE_NOT_FOUND". > Reason being it lists the data that is deleted but still that deletion is > not synced. > Is there a way we can tune our spark configuration, to remove such exceptions. > Further what should be kept in mind while interacting with S3 for read and > write. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org