Thanks for your prompt reply Gourav. I am using Spark 2.4.0 (cloudera distribution). The job consistently threw this error, so I narrowed down the dataset by adding a date filter (date rang: 2018-01-01 to 2018-06-30).. However it's still throwing the same error!
*command*: spark2-submit --master yarn --deploy-mode client --executor-memory 15G --executor-cores 5 samplerestage.py cluster: 4 nodes, 32 cores each 256GB RAM This is the only job running, with 20 executors... I would really like to know the best practice around creating partitioned table using pays-ark - every time I need to partition huge dataset, I run into such issues. Appreciate your help! On Wed, Jul 31, 2019 at 10:58 PM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi Rishi, > > there is no version as 2.4 :), can you please specify the exact SPARK > version you are using? How are you starting the SPARK session? And what is > the environment? > > I know this issue occurs intermittently over large writes in S3 and has to > do with S3 eventual consistency issues. Just restarting the job sometimes > helps. > > > Regards, > Gourav Sengupta > > On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah <rishishah.s...@gmail.com> > wrote: > >> Hi All, >> >> I have a dataframe of size 2.7T (parquet) which I need to partition by >> date, however below spark program doesn't help - keeps failing due to *file >> already exists exception..* >> >> df = spark.read.parquet(INPUT_PATH) >> >> df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH) >> >> I did notice that couple of tasks failed and probably that's why it tried >> spinning up new ones which write to the same .staging directory? >> >> -- >> Regards, >> >> Rishi Shah >> > -- Regards, Rishi Shah