Thanks for your prompt reply Gourav. I am using Spark 2.4.0 (cloudera
distribution). The job consistently threw this error, so I narrowed down
the dataset by adding a date filter (date rang: 2018-01-01 to 2018-06-30)..
However it's still throwing the same error!
*command*: spark2-submit --master yarn --deploy-mode client
--executor-memory 15G --executor-cores 5 samplerestage.py
cluster: 4 nodes, 32 cores each 256GB RAM
This is the only job running, with 20 executors...
I would really like to know the best practice around creating partitioned
table using pays-ark - every time I need to partition huge dataset, I run
into such issues. Appreciate your help!
On Wed, Jul 31, 2019 at 10:58 PM Gourav Sengupta
wrote:
> Hi Rishi,
>
> there is no version as 2.4 :), can you please specify the exact SPARK
> version you are using? How are you starting the SPARK session? And what is
> the environment?
>
> I know this issue occurs intermittently over large writes in S3 and has to
> do with S3 eventual consistency issues. Just restarting the job sometimes
> helps.
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah
> wrote:
>
>> Hi All,
>>
>> I have a dataframe of size 2.7T (parquet) which I need to partition by
>> date, however below spark program doesn't help - keeps failing due to *file
>> already exists exception..*
>>
>> df = spark.read.parquet(INPUT_PATH)
>>
>> df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)
>>
>> I did notice that couple of tasks failed and probably that's why it tried
>> spinning up new ones which write to the same .staging directory?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
--
Regards,
Rishi Shah