Re: [Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Rishi Shah
Thanks for your prompt reply Gourav. I am using Spark 2.4.0 (cloudera
distribution). The job consistently threw this error, so I narrowed down
the dataset by adding a date filter (date rang: 2018-01-01 to 2018-06-30)..
However it's still throwing the same error!

*command*: spark2-submit --master yarn --deploy-mode client
--executor-memory 15G --executor-cores 5 samplerestage.py
cluster: 4 nodes, 32 cores each 256GB RAM

This is the only job running, with 20 executors...

I would really like to know the best practice around creating partitioned
table using pays-ark - every time I need to partition huge dataset, I run
into such issues. Appreciate your help!


On Wed, Jul 31, 2019 at 10:58 PM Gourav Sengupta 
wrote:

> Hi Rishi,
>
> there is no version as 2.4 :), can you please specify the exact SPARK
> version you are using? How are you starting the SPARK session? And what is
> the environment?
>
> I know this issue occurs intermittently over large writes in S3 and has to
> do with S3 eventual consistency issues. Just restarting the job sometimes
> helps.
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I have a dataframe of size 2.7T (parquet) which I need to partition by
>> date, however below spark program doesn't help - keeps failing due to *file
>> already exists exception..*
>>
>> df = spark.read.parquet(INPUT_PATH)
>>
>> df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)
>>
>> I did notice that couple of tasks failed and probably that's why it tried
>> spinning up new ones which write to the same .staging directory?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah


Re: [Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Gourav Sengupta
Hi Rishi,

there is no version as 2.4 :), can you please specify the exact SPARK
version you are using? How are you starting the SPARK session? And what is
the environment?

I know this issue occurs intermittently over large writes in S3 and has to
do with S3 eventual consistency issues. Just restarting the job sometimes
helps.


Regards,
Gourav Sengupta

On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah  wrote:

> Hi All,
>
> I have a dataframe of size 2.7T (parquet) which I need to partition by
> date, however below spark program doesn't help - keeps failing due to *file
> already exists exception..*
>
> df = spark.read.parquet(INPUT_PATH)
>
> df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)
>
> I did notice that couple of tasks failed and probably that's why it tried
> spinning up new ones which write to the same .staging directory?
>
> --
> Regards,
>
> Rishi Shah
>


[Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Rishi Shah
Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by
date, however below spark program doesn't help - keeps failing due to *file
already exists exception..*

df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried
spinning up new ones which write to the same .staging directory?

-- 
Regards,

Rishi Shah