Re: [Pyspark 2.4] not able to partition the data frame by dates

Gourav Sengupta Wed, 31 Jul 2019 19:59:02 -0700

Hi Rishi,

there is no version as 2.4 :), can you please specify the exact SPARK
version you are using? How are you starting the SPARK session? And what is
the environment?


I know this issue occurs intermittently over large writes in S3 and has to
do with S3 eventual consistency issues. Just restarting the job sometimes
helps.


Regards,
Gourav Sengupta

On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah <rishishah.s...@gmail.com> wrote:

> Hi All,
>
> I have a dataframe of size 2.7T (parquet) which I need to partition by
> date, however below spark program doesn't help - keeps failing due to *file
> already exists exception..*
>
> df = spark.read.parquet(INPUT_PATH)
>
> df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)
>
> I did notice that couple of tasks failed and probably that's why it tried
> spinning up new ones which write to the same .staging directory?
>
> --
> Regards,
>
> Rishi Shah
>

Re: [Pyspark 2.4] not able to partition the data frame by dates

Reply via email to